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Preface to the Second Edition 



‘What do you plan to do next?’ de Wetherset asked, picking up a piece of 
vellum covered with minute writing and studying it. Bartholomew rose to 
leave. The Chancellor clearly was not interested in how they went about 
getting the information, only in what they discovered. 

— Susanna Gregory, An Unholy Alliance 



It is a tribute to our profession that a textbook that was current in 1999 is 
starting to feel old. The work for the first edition of Monte Carlo Statistical 
Methods (MCSMl) was finished in late 1998, and the advances made since 
then, as well as our level of understanding of Monte Carlo methods, have 
grown a great deal. Moreover, two other things have happened. Topics that 
just made it into MCSMl with the briefest treatment (for example, perfect 
sampling) have now attained a level of importance that necessitates a much 
more thorough treatment. Secondly, some other methods have not withstood 
the test of time or, perhaps, have not yet been fully developed, and now receive 
a more appropriate treatment. 

When we worked on MCSMl in the mid-to-late 90s, MCMC algorithms 
were already heavily used, and the flow of publications on this topic was at 
such a high level that the picture was not only rapidly changing, but also 
necessarily incomplete. Thus, the process that we followed in MCSMl was 
that of someone who was thrown into the ocean and was trying to grab onto 
the biggest and most seemingly useful objects while trying to separate the 
flotsam from the jetsam. Nonetheless, we also felt that the fundamentals of 
many of these algorithms were clear enough to be covered at the textbook 
level, so we swam on. 

In this revision, written five years later, we have the luxury of a more 
relaxed perspective on the topic, given that the flurry of activity in this area 
has slowed somewhat into a steady state. This is not to say that there is no 
longer any research in this area, but simply that the tree of publications, which 
was growing in every direction in 1998, can now be pruned, with emphasis 
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being put on the major branches. For this new edition, we thus spent a good bit 
of time attempting to arrange the material (especially in the first ten chapters) 
to be presented in a coherent, flowing story, with emphasis on fundamental 
principles. In doing so the “Fundamental Theorem of Simulation” emerged, 
which we now see as a basis for many Monte Carlo algorithms (as developed 
in Chapters 3 and 8). 

As a consequence of this coming-of-age of MCMC methods, some of the 
original parts of MCSMl have therefore been expanded, and others shrunken. 
For example, reversible jump, sequential MC methods, two-stage Gibbs and 
perfect sampling now have their own chapters. Also, we now put less emphasis 
on some of the finer details of convergence theory, because of the simultane- 
ous publication of Roberts and Tweedie (2004), which covers the theory of 
MCMC, with a comprehensive treatment of convergence results, and in gen- 
eral provides a deeper entry to the theory of MCMC algorithms. 

We also spend less time on convergence control, because some of the meth- 
ods presented in MCSMl did not stand the test of time. The methods we 
preserved in Chapter 12 have been sufficiently tested to be considered reli- 
able. Finally, we no longer have a separate chapter on missing (or latent) 
variables. While these models are usually a case study ideal for assessing sim- 
ulation methods, they do not enjoy enough of a unified structure to be kept 
as a separate chapter. Instead, we dispatched most of the remaining models 
to different chapters of this edition. 

From a (more) pedagogical point of view, we revised the book towards more 
accessibility and readability, thus removing the most theoretical examples and 
discussions. We also broke up the previously long chapters on Monte Carlo 
integration and Gibbs sampling into more readable chapters, with increasing 
coverage and difficulty. For instance, Gibbs sampling is first introduced via 
the slice sampler, which is simpler to describe and analyze, then the two- 
stage Gibbs sampler (or Data Augmentation) is presented on its own and 
only then do we launch into processing the general Gibbs sampler. Similarly, 
the experience of the previous edition led us to remove several problems or 
examples in every chapter, to include more detailed examples and fundamental 
problems, and to improve the help level on others. 

Throughout the preparation of this book, and of its predecessors, we 
were fortunate to have colleagues who provided help. George Fishman, Anne 
Philippe, Judith Rousseau, as well as numerous readers, pointed out typos 
and mistakes in the previous version. We especially grateful to Christophe An- 
drieu, Roberto Casarin, Nicolas Chopin, Arnaud Doucet, Jim Robert, Merrilee 
Hum, Jean-Michel Marin, and Jesper Mpller, Francois Perron, Arafat Tayeb, 
and an anonymous referee, for detailed reading of parts (or the whole) of the 
manuscript of this second version. Obviously, we, rather than they, should 
be held responsible for any imperfection remaining in the current edition! We 
also gained very helpful feedback from the audiences of our lectures on MCMC 
methods, especially during the summer schools of Luminy (France) in 2001 
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and 2002; Les Diablerets (Switzerland) and Venezia (Italy) in 2003; Orlando 
in 2002 and 2003; Atlanta in 2003; and Oulanka (Finland) in 2004. 

Thanks to Elias Moreno for providing a retreat in Granada for the launch- 
ing of this project in November 2002 (almost) from the top of Mulhacen; to 
Manuella Delbois who made the move to BibTeX possible by translating the 
entire reference list of MCMCl into BibTeX format; to Jeff Gill for his patient 
answers to our R questions, and to Olivier Cappe for never-ending Linux and 
support, besides his more statistical (but equally helpful!) comments 
and suggestions. 



Christian P. Robert 
George Casella 

June 15, 2004 
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He sat, continuing to look down the nave, when suddenly the solution to 
the problem just seemed to present itself. It was so simple, so obvious he 
just started to laugh... 

— P.C. Doherty, Satan in St Mary’s 



Monte Carlo statistical methods, particularly those based on Markov chains, 
have now matured to be part of the standard set of techniques used by statis- 
ticians. This book is intended to bring these techniques into the classroom, 
being (we hope) a self-contained logical development of the subject, with 
all concepts being explained in detail, and all theorems, etc. having detailed 
proofs. There is also an abundance of examples and problems, relating the 
concepts with statistical practice and enhancing primarily the application of 
simulation techniques to statistical problems of various difficulties. 

This is a textbook intended for a second-year graduate course. We do 
not assume that the reader has any familiarity with Monte Carlo techniques 
(such as random variable generation) or with any Markov chain theory. We do 
assume that the reader has had a first course in statistical theory at the level 
of Statistical Inference by Casella and Berger (1990). Unfortunately, a few 
times throughout the book a somewhat more advanced notion is needed. We 
have kept these incidents to a minimum and have posted warnings when they 
occur. While this is a book on simulation, whose actual implementation must 
be processed through a computer, no requirement is made on programming 
skills or computing abilities: algorithms are presented in a program-like format 
but in plain text rather than in a specific programming language. (Most of 
the examples in the book were actually implemented in C, with the S-Plus 
graphical interface.) 

Chapters 1-3 are introductory. Chapter 1 is a review of various statistical 
methodologies and of corresponding computational problems. Chapters 2 and 
3 contain the basics of random variable generation and Monte Carlo integra- 
tion. Chapter 4, which is certainly the most theoretical in the book, is an 
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introduction to Markov chains, covering enough theory to allow the reader 
to understand the workings and evaluate the performance of Markov chain 
Monte Carlo (MCMC) methods. Section 4.1 is provided for the reader who 
already is familiar with Markov chains, but needs a refresher, especially in the 
application of Markov chain theory to Monte Carlo calculations. Chapter 5 
covers optimization and provides the first application of Markov chains to sim- 
ulation methods. Chapters 6 and 7 cover the heart of MCMC methodology, 
the Metropolis-Hastings algorithm and the Gibbs sampler. Finally, Chap- 
ter 8 presents the state-of-the-art methods for monitoring convergence of the 
MCMC methods and Chapter 9 shows how these methods apply to some sta- 
tistical settings which cannot be processed otherwise, namely the missing data 
models. 

Each chapter concludes with a section of notes that serve to enhance the 
discussion in the chapters, describe alternate or more advanced methods, and 
point the reader to further work that has been done, as well as to current 
research trends in the area. The level and rigor of the notes are variable, with 
some of the material being advanced. 

The book can be used at several levels and can be presented in several ways. 
For example. Chapters 1-3 and most of Chapter 5 cover standard simulation 
theory, and hence serve as a basic introduction to this topic. Chapters 6-9 are 
totally concerned with MCMC methodology. A one-semester course, assum- 
ing no familiarity with random variable generation or Markov chain theory 
could be based on Chapters 1-7, with some illustrations from Chapters 8 and 
9. For instance, after a quick introduction with examples from Chapter 1 or 
Section 3.1, and a description of Accept-Reject techniques of Section 2.3, the 
course could cover Monte Carlo integration (Section 3.2, Section 3.3 [except 
Section 3.3.3], Section 3.4, Section 3.7), Markov chain theory through either 
Section 4.1 or Section 4.2-Section 4.8 (while adapting the depth to the mathe- 
matical level of the audience), mention stochastic optimization via Section 5.3, 
and describe Metropolis-Hastings and Gibbs algorithms as in Chapters 6 and 
7 (except Section 6.5, Section 7.1.5, and Section 7.2.4). Depending on the time 
left, the course could conclude with some diagnostic methods of Chapter 8 (for 
instance, those implemented in CODA) and/or some models of Chapter 9 (for 
instance, the mixture models of Section 9.3 and Section 9.4). Alternatively, 
a more advanced audience could cover Chapter 4 and Chapters 6-9 in one 
semester and have a thorough introduction to MCMC theory and methods. 

Much of the material in this book had its original incarnation as the French 
monograph Methodes de Monte Carlo par Chaines de Markov by Christian 
Robert (Paris: Economica, 1996), which has been tested for several years on 
graduate audiences (in Prance, Quebec, and even Norway). Nonetheless, it 
constitutes a major revision of the French text, with the inclusion of prob- 
lems, notes, and the updating of current techniques, to keep up with the ad- 
vances that took place in the past two years (like Langevin diffusions, perfect 
sampling, and various types of monitoring). 
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Throughout the preparation of this book, and its predecessor, we were for- 
tunate to have colleagues who provided help. Sometimes this was in the form 
of conversations or references (thanks to Steve Brooks and Sid Chib!), and 
a few people actually agreed to read through the manuscript. Our colleague 
and friend, Costas Goutis, provided many helpful comments and criticisms, 
mostly on the French version, but these are still felt in this version. We are 
also grateful to Brad Carlin, Dan Fink, Jim Robert, Galin Jones and Kris- 
hanu Maulik for detailed reading of parts of the manuscript, to our historian 
Walter Piegorsch, and to Richard Tweedie, who taught from the manuscript 
and provided many helpful suggestions, and to his students, Nicole Benton, 
Sarah Streett, Sue Taylor, Sandy Thompson, and Alex Trindade. Richard, 
whose influence on the field was considerable, both from a theoretical and a 
methodological point of view, most sadly passed away last July. His spirit, his 
humor and his brightness will remain with us for ever, as a recurrent process. 
Christophe Andrieu, Virginie BraMo, Jean-Jacques Colleau, Randall Done, 
Arnaud Doucet, George Fishman, Jean-Louis Foulley, Arthur Gretton, Ana 
Justel, Anne Philippe, Sandrine Micaleff, and Judith Rousseau pointed out 
typos and mistakes in either the French or the English versions (or both), but 
should not be held responsible for those remaining! Part of Chapter 8 has a 
lot in common with a “reviewww” written by Christian Robert with Chantal 
Guihenneuc-Jouyaux and Kerrie Mengersen for the Valencia Bayesian meet- 
ing (and the Internet!). The input of the French working group “MC Cube,” 
whose focus is on convergence diagnostics, can also be felt in several places of 
this book. Wally Gilks and David Spiegelhalter granted us permission to use 
their graph (Figure 2.5) and examples as Problems 10.29-10.36, for which we 
are grateful. Agostino Nobile kindly provided the data on which Figures 10.4 
and 10.4 are based. Finally, Arnoldo Frigessi (from Roma) made the daring 
move of teaching (in English) from the French version in Oslo, Norway; not 
only providing us with very helpful feedback but also contributing to making 
Europe more of a reality! 



Christian P. Robert 
George Casella 

January 2002 
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Introduction 



There must be, he thought, some key, some crack in this mystery he could 
use to achieve an answer. 

— P.C. Doherty, Crown in Darkness 



Until the advent of powerful and accessible computing methods, the experi- 
menter was often confronted with a difficult choice. Either describe an accu- 
rate model of a phenomenon, which would usually preclude the computation 
of explicit answers, or choose a standard model which would allow this com- 
putation, but may not be a close representation of a realistic model. This 
dilemma is present in many branches of statistical applications, for example, 
in electrical engineering, aeronautics, biology, networks, and astronomy. To 
use realistic models, the researchers in these disciplines have often developed 
original approaches for model fitting that are customized for their own prob- 
lems. (This is particularly true of physicists, the originators of Markov chain 
Monte Carlo methods.) Traditional methods of analysis, such as the usual 
numerical analysis techniques, are not well adapted for such settings. 

In this introductory chapter, we examine some of the statistical models 
and procedures that contributed to the development of simulation-based in- 
ference. The first section of this chapter looks at some statistical models, 
and the remaining sections examine different statistical methods. Throughout 
these sections, we describe many of the computational difficulties associated 
with the methods. The final section of the chapter contains a discussion of 
deterministic numerical analysis techniques. 



1.1 Statistical Models 

In a purely statistical setup, computational difficulties occur at both the level 
of probabilistic modeling of the inferred phenomenon and at the level of statis- 
tical inference on this model (estimation, prediction, tests, variable selection. 
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etc.). In the first case, a detailed representation of the causes of the phe- 
nomenon, such as accounting for potential explanatory variables linked to the 
phenomenon, can lead to a probabilistic structure that is too complex to al- 
low for a parametric representation of the model. Moreover, there may be no 
provision for getting closed- form estimates of quantities of interest. One setup 
with this type of complexity is expert systems (in medicine, physics, finance, 
etc.) or, more generally, graph structures. See Pearl (1988), Robert (1991),^ 
Spiegelhalter et al. (1993), Lauritzen (1996) for examples of complex expert 
systems.^ 

Another situation where model complexity prohibits an explicit represen- 
tation appears in econometrics (and in other areas) for structures of latent (or 
missing) variable models. Given a “simple” model, aggregation or removal of 
some components of this model may sometimes produce such involved struc- 
tures that simulation is really the only way to draw an inference. In these 
situations, an often used method for estimation is the EM algorithm (Demp- 
ster et al. 1977), which is described in Chapter 3. In the following example, we 
illustrate a common missing data situation. The concept and use of missing 
data techniques and in particular of the two following examples will reoccur 
throughout the book. 

Example 1.1. Censored data models. Censored data models are missing 
data models where densities are not sampled directly. To obtain estimates and 
make inferences in such models usually requires involved computations and 
precludes analytical answers. 

In a typical simple statistical model, we would observe random variables^ 
(rv’s) Yi,...,Tn, drawn independently from a population with distribution 
f{y\0). The distribution of the sample would then be given by the product 
nr=i Inference about 6 would be based on this distribution. 

In many studies, particularly in medical statistics, we have to deal with 
censored random variables; that is, rather than observing li, we may observe 
min{yi,U}, where II is a constant. For example, if Yi is the survival time of 
a patient receiving a particular treatment and u is the length of the study 
being done (say u = 5 years), then if the patient survives longer than 5 years, 
we do not observe the survival time, but rather the censored value u. This 
modification leads to a more difficult evaluation of the sample density. 

Barring cases where the censoring phenomenon can be ignored, several 
types of censoring can be categorized by their relation with an underlying 
(unobserved) model, Yi ~ fiVilO): 

^ Claudine, not Christian! 

^ Section 10.6.4 also gives a brief introduction to graphical models in connection 
with Gibbs sampling. 

^ Throughout the book we will use uppercase Roman letters for random variables 
and lowercase Roman letters for their realized values. Thus, we would observe 
X = X, where the random variable X produces the observation x. (For esthetic 
purposes this distinction is sometimes lost with Greek letter random variables.) 
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(i) Given random variables which are, for instance, times of observation 
or concentrations, the actual observations are Y* = min{yi, u}, where u is 
the maximal observation duration, the smallest measurable concentration 
rate, or some other truncation point. 

(ii) The original variables Yi are kept in the sample with probability p{yi) and 
the number of censored variables is either known or unknown. 

(iii) The variables Yi are associated with auxiliary variables Xi g such that 

y* = h{yi^Xi) is the observation. Typically, h{yi,Xi) = mm{yi,Xi). The 
fact that truncation occurred, namely the variable niay be either 

known or unknown. 



As a particular example, if 

X ^Af{6,a^) and Y 



the variable Z = X /\Y = min(X, Y) is distributed as 



( 1 . 1 ) 



1 -^ 



z-9 

a 



r ^(f 



z — p 






1 -^ 



z — p 

T 




where (p is the density of the normal 7^(0, 1) distribution and ^ is the corre- 
sponding cdf, which is not easy to compute. 

Similarly, if X has a Weibull distribution with two parameters, >Ve(a,/?), 
and density 

f{x) — a(3x^~^ exp(-/3x‘^) 

on R“^, the observation of the censored variable Z = XAu, where u) is constant, 
has the density 



(1.2) f{z) = Iz<uj + ol(3x'^ 8u}{z) , 

where ^a(*) is the Dirac mass at a. In this case, the weight of the Dirac mass, 
P{X > cj), can be explicitly computed (Problem 1.4). 

The distributions (1.1) and (1.2) appear naturally in quality control ap- 
plications. There, testing of a product may be of a duration cj, where the 
quantity of interest is time to failure. If the product is still functioning at the 
end of the experiment, the observation on failure time is censored. Similarly, 
in a longitudinal study of a disease, some patients may leave the study either 
due to other causes of death or by simply being lost to follow-up. || 



In some cases, the additive form of a density, while formally explicit, pro- 
hibits the computation of the density of a sample (Xi , . . . ,Xn) for n large. 
(Here, “explicit” has the restrictive meaning that “it can be computed in a 
reasonable time.” ) 
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Example 1.2. Mixture models. Models of mixtures of distributions are 
based on the assumption that the observations Xi are generated from one of 
k elementary distributions fj with probability pj , the overall density being 

(1-3) X --- pifi{x) + . . . +pkfk{x) . 

If we observe a sample of independent random variables (Xi, . . . ,X„), the 
sample density is 

n 

Yl{Plfl{Xi) + ---+Pkfk{Xi)} . 

i=l 

When fj{x) = f{x\0j)^ the evaluation of the likelihood at a given value of 
(^ 1 , . . . , only requires on the order‘d of 0{kn) computations, 

but we will see later that likelihood and Bayesian inferences both require the 
expansion of the above product, which involves 0{k^) computations, and is 
thus prohibitive to compute in large samples.^ 

While the computation of standard moments like the mean or the variance 
of these distributions is feasible in many setups (and thus so is the derivation 
of moment estimators, see Problem 1.6), the representation of the likelihood 
function (and therefore the analytical computation of maximum likelihood or 
Bayes estimates) is generally impossible for mixtures. || 

Finally, we look at a particularly important example in the processing of 
temporal (or time series) data where the likelihood cannot be written explic- 
itly. 

Example 1.3. Moving average model. An MA(g) model describes vari- 
ables {Xt) that can be modeled as (t = 0, . . . , n) 

Q 

(1-4) Xt ^ St , 

where for i = — — (g^ — 1), . . ., the efs are iid random variables £i ~ A/’(0, cr^) 

and for j = 1, . . . , g, the /?^ ’s are unknown parameters. If the sample consists 
of the observation (Aq, . . . , A^), where n > g, the sample density is (Problem 
1.5) 

^ Recall that the notation 0(n) denotes a function that satisfies 

0 < limsup^^^ 0(n)/n < oo. 

^ This class of models will be used extensively over the book. Although the example 
is self-contained, detailed comments about such models are provided in Note 9.7.1 
and Titterington et al. (1985). 
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(1.5) 

with 




^0 - Eti 

a 




X ^ 
X 



X\ Pl^o X^2z=2 






ds—i • • * ds- 



q ) 



io = Xq 



£l = Xi 



q 

2=1 

q 

PiSi-i - Piio , 

i=2 



= Xr. 



q 

^ ^ Pi^n—i • 
2=1 



The iterative definition of the is a real obstacle to an explicit integration 
in (1.5) and hinders statistical inference in these models. Note that for i = 
— 1)? • • • 5 “1 fhe perturbations £-i can be interpreted as missing data 
(see Section 5.3.1). || 



Before the introduction of simulation-based inference, computational dif- 
ficulties encountered in the modeling of a problem often forced the use of 
“standard” models and “standard” distributions. One course would be to use 
models based on exponential families^ defined below by (1.9) (see Brown 1986, 
Robert 2001, Lehmann and Casella 1998), which enjoy numerous regularity 
properties (see Note 1.6.1). Another course was to abandon parametric rep- 
resentations for nonpar ametric approaches which are, by definition, robust 
against modeling errors. 

We also note that the reduction to simple, perhaps non-realistic, distribu- 
tions (necessitated by computational limitations) does not necessarily elimi- 
nate the issue of nonexplicit expressions, whatever the statistical technique. 
Our major focus is the application of simulation-based techniques to provide 
solutions and inference for a more realistic set of models and, hence, circum- 
vent the problems associated with the need for explicit or computationally 
simple answers. 



1.2 Likelihood Methods 

The statistical techniques that we will be most concerned with are maximum 
likelihood and Bayesian methods, and the inferences that can be drawn from 
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their use. In their implementation, these approaches are customarily associ- 
ated with specific mathematical computations, the former with maximization 
problems — and thus to an implicit definition of estimators as solutions of 
maximization problems — the later with integration problems — and thus to 
a (formally) explicit representation of estimators as an integral. (See Berger 
1985, Casella and Berger 2001, Robert 2001, Lehmann and Casella 1998, for 
an introduction to these techniques.) 

The method of maximum likelihood estimation is quite a popular technique 
for deriving estimators. Starting from an iid sample x = (xi, . . . , Xn) from a 
population with density /(x|^i, . . . , 0^), the likelihood function is 



L(6>|x) = L{ei,...,ek\xi,...,x„) 

(1-6) = n, f{xi\0i,...,ek). 

More generally, when the X^’s are not iid, the likelihood is defined as the joint 
density /(xi, . . . , Xn|^) taken as a function of 6. The value of 0, say 0, which 
is the parameter value at which L{0\x.) attains its maximum as a function 
of 0, with X held fixed, is known as a maximum likelihood estimator (MLE). 
Notice that, by its construction, the range of the MLE coincides with the 
range of the parameter. The justifications of the maximum likelihood method 
are primarily asymptotic, in the sense that the MLE is converging almost 
surely to the true value of the parameter, under fairly general conditions (see 
Lehmann and Casella 1998) although it can also be interpreted as being at 
the fringe of the Bayesian paradigm (see, e.g., Berger and Wolpert 1988). 



Example 1.4. Gamma MLE. A maximum likelihood estimator is typi- 
cally calculated by maximizing the logarithm of the likelihood function (1.6). 
Suppose Xi, . . . , Xn are iid observations from the gamma density 



f{x\a,P) 



r{a)P°‘ 






where we assume that a is known. The log likelihood is 



n 

logi(a,/3|xi, ...,Xn) = ^og]J f{Xi\a,fi) 

i=l 



= logII 



\ r{a)f5- 






= —n\ogr{a) — na log/? + (a 



n n 

l)X^loga;i -^Xi/0, 

i=l i=l 



where we use the fact that the log of the product is the sum of the logs, and 
have done some simplifying algebra. Solving ^ logL(a, /3|xi, . . . ,Xn) = 0 is 

straightforward and yields the MLE of /?, /3 == Yll=i ^i/(^<^)- 
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Suppose now that a was also unknown, and we additionally had to solve 

d 

^ log L{a, f3\xi ,...,Xn)=0. 

This results in a particularly nasty equation, involving some difficult compu- 
tations (such as the derivative of the gamma function, the digamma function). 
An explicit solution is no longer possible. || 

Calculation of maximum likelihood estimators can sometimes be imple- 
mented through the minimization of a sum of squared residuals, which is the 
basis of the method of least squares. 

Example 1.5. Least squares estimators. Estimation by least squares can 
be traced back to Legendre (1805) and Gauss (1810) (see Stigler 1986). In the 
particular case of linear regression, we observe (x^, yi), i = 1, . . . , n, where 

(1.7) Yi = a bxi + Si, z = l,...,n, 

and the variables e^’s represent errors. The parameter (a, 6) is estimated by 
minimizing the distance 

n 

(1.8) ^ {vi - axi - bf 

i=\ 

in (a, 6), yielding the least squares estimates. If we add more structure to 
the error term, in particular that Si ~ AT(0,cr^), independent (equivalently, 
Yi\xi ~ Af{axi + 6,(7^)), the log-likelihood function for (a, 5) is proportional 
to 

n 

log(CT’’”) - axi - bfj2a‘^, 

i=l 

and it follows that the maximum likelihood estimates of a and b are identical 
to the least squares estimates. 

If, in (1.8), we assume E(^i) = 0, or, equivalently, that the linear rela- 
tionship E[y|(r] = ax b holds, minimization of (1.8) is equivalent, from a 
computational point of view, to imposing a normality assumption on Y con- 
ditionally on X and applying maximum likelihood. In this latter case, the 
additional estimator of is consistent if the normal approximation is asymp- 
totically valid. (See Gourieroux and Monfort 1996, for the related theory of 
pseudo-likelihood.) || 

Although somewhat obvious, this formal equivalence between the opti- 
mization of a function depending on the observations and the maximization 
of a likelihood associated with the observations has a nontrivial outcome and 
applies in many other cases. For example, in the case where the parameters 
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are constrained, Robertson et al. (1988) consider a p x ^ table of random 
variables Yij with means 6ij, where the means are increasing in i and j. Es- 
timation of the %’s by minimizing the sum of the {yij — %)^’s is possible 
through the (numerical) algorithm called ‘'pool- adjacent- violators, ” developed 
by Robertson et al. (1988) to solve this specific problem. (See Problems 1.18 
and 1.19.) An alternative is to use an algorithm based on simulation and a 
representation using a normal likelihood (see Section 5.2.4). 

In the context of exponential families, that is, distributions with density 

(1.9) f{x) = h{x) 6,xe R'', 

the approach by maximum likelihood is (formally) straightforward. The max- 
imum likelihood estimator of 6 is the solution of 

(1.10) X = V'lpieix)} , 

which also is the equation yielding a method of moments estimator, since 
E0[A] = V'0(0). The function %Ij is the log-Laplace transform, or cumulant 
generating function oi h; that is, '0(t) = logE[exp{t/i(X)}], where we recognize 
the right side as the log moment generating function of h. 

Example 1.6. Normal MLE. In the setup of the normal Af{/a,a‘^) distri- 
bution, the density can be written as in (1.9), since 

f{y\fx,a) oc 



The so-called natural parameters are then 9i = and 02 = ~l/2cr^, with 
'0(0) = — 0J/402 H- log(— 02/2)/2. While there is no MLE for a single observa- 
tion from A/*(/i, cr^), equation (1.10) leads to 



( 1 . 11 ) 



"4 - ^ = TJl, yi = + y^) ’ 



in the case of n iid observations that is, to the regular MLE, 

(A, = (y, s^), where = X)(y* - vf/n. || 



Unfortunately, there are many settings where 'll; cannot be computed ex- 
plicitly. Even if it could be done, it may still be the case that the solution of 
(1.10) is not explicit, or there are constraints on 0 such that the maximum of 
(1.9) is not a solution of (1.10). This last situation occurs in the estimation 
of the table of Oij^s in the discussion above. 

Example 1.7. Beta MLE. The Beta Be{a,P) distribution is a particular 
case of exponential family since its density. 




1.2 Likelihood Methods 



9 



/(y|«, 0) = r(a)^r{p) ^ ^ i> 

can be written as (1.9), with 6 = {a,/3) and x = (log y,log(l — y)). Equation 
(1.10) becomes 

n lo'i logy ^ O'(a) - !f’(a + /?) , 

^ ’ \og{l-y) = ^{/3) -^a + 0) , 

where ^{z) = d\ogr{z)/dz denotes the digamma function (see Abramowitz 
and Stegun 1964). There is no explicit solution to (1.12). As in Example 1.6, 
although it may seem absurd to estimate both parameters of the Be{a,(3) 
distribution from a single observation, F, the formal computing problem at 
the core of this example remains valid for a sample Yi, . . . , since (1.12) is 
then replaced by 



i 

~ Fi°g(i- “ y*) = “ ^(“+/?) • 

i 



When the parameter of interest A is not a one-to-one function of 0, that is, 
when there are nuisance parameters, the maximum likelihood estimator of A is 
still well defined. If the parameter vector is of the form 6 = (A, ^), where '0 is a 
nuisance parameter, a typical approach is to calculate the full MLE 6 = (A, -0) 
and use the resulting A to estimate A. In principle, this does not require more 
complex calculations, although the distribution of the maximum likelihood 
estimator of A, A, may be quite involved. Many other options exist, such as 
conditional, marginal, or profile likelihood (see, for example, Barndorff-Nielsen 
and Cox 1994). 

Example 1.8. Noncentrality parameter. If X ~ Mp{6,Ip) and if A = 
II^IP is the parameter of interest, the nuisance parameters are the angles ^ 
in the polar representation of 6 and the maximum likelihood estimator of 
A is \{x) = ||x|p, which has a constant bias equal to p. Surprisingly, an 
observation Y = ||X|p which has a noncentral chi squared distribution, XpW 
(see Appendix A) , leads to a maximum likelihood estimator of A which differs^ 
from Y, since it is the solution of the implicit equation 

(1.13) Vx i{p-i)/2 (\/^) = Vy ip/2 (\/^) . y>p, 

where is the modified Bessel function 



® This phenomenon is not paradoxical, as Y = \\X\\^ is not a sufficient statistic in 
the original problem. 
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^-Kr(y + 2 ) ^ 

^ /A-- (z/2)^fe 

V2y ^^fc!r(i. + A:+l)- 

So even in the favorable context of exponential families, we are not necessarily 
free from computational problems, since the resolution of (1.13) requires us 
first to evaluate the special functions 7^/2 and 7(p_i)/2- Note also that the 
maximum likelihood estimator is not a solution of (1.13) when y < p (see 
Problem 1.20). || 



When we leave the exponential family setup, we face increasingly chal- 
lenging difficulties in using maximum likelihood techniques. One reason for 
this is the lack of a sufficient statistic of fixed dimension outside exponential 
families, barring the exception of a few families such as uniform or Pareto 
distributions whose support depends on 6 (Robert 2001, Section 3.2). This re- 
sult, known as the Pitman-Koopman Lemma (see Lehmann and Casella 1998, 
Theorem 1.6.18), implies that, outside exponential families, the complexity of 
the likelihood increases quite rapidly with the number of observations, n and, 
thus, that its maximization is delicate, even in the simplest cases. 



Example 1.9. Student’s t distribution. Modeling random perturbations 
using normally distributed errors is often (correctly) criticized as being too 
restrictive and a reasonable alternative is the Student’s t distribution, denoted 
by T{p,9,a), which is often more “robust” against possible modeling errors 
(and others). The density of T (p, a) is proportional to 



(1.14) 



^-1 



1 -h 



{x - ef 

pa‘^ 



-(P+l)/2 



Typically, p is known and the parameters 0 and a are unknown. Based on an 
iid sample (Xi, . . . , Xn) from (1.14), the likelihood is proportional to a power 
of the product 



cr 



n 






1 + 



(Xj - 6>)^ \ 
) 



When cr is known, for some configurations of the sample, this polynomial of 
degree 2n may have n local minima, each of which needs to be calculated 
to determine the global maximum, the maximum likelihood estimator (see 
also Problem 1.14). Figure 1.1 illustrates this multiplicity of modes of the 
likelihood from a Cauchy distribution C(^, 1) (p = 1) when n = 3 and Xi = 0, 
X 2 = 5, and X3 = 9. || 
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Fig. 1.1. Likelihood of the sample (0, 5, 9) from the distribution C{0, 1). 



Example 1.10. (Continuation of Example 1.2) In the special case of a 
mixture of two normal distributions, 

pAfilJ., T^) + (1 - p)M{6, a^) , 

an iid sample (Xi, . . . , X„) results in a likelihood function proportional to 



(1.15) 



n 



pr 



=1 >- 



(f 



containing 2'^ terms if expanded. Standard maximization techniques often fail 
to find the global maximum because of multimodality of the likelihood func- 
tion, and specific algorithms must be devised (to obtain the global maximum 
with high probability). 

The problem is actually another order of magnitude more difficult, since 
the likelihood is unbounded here. The expansion of the product (1.15) contains 
the terms 




This expression is unbounded in a (let cr go to 0 when 6 = xi). However, 
this difficulty with the likelihood function does not preclude us from using 
the maximum likelihood approach in this context, since Redner and Walker 
(1984) have shown that there exist solutions to the likelihood equations, that 
is, local maxima of (1.15), which have acceptable properties. (Similar problems 
occur in the context of linear regression with “errors in variables.” See Casella 
and Berger 2001, Chapter 12, for an introduction.) || 
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In addition to the difficulties associated with optimization problems, like- 
lihood-related approaches may also face settings where the likelihood function 
is only expressible as an integral (for example, the censored data models of Ex- 
ample 1.1). Similar computational problems arise in the determination of the 
power of a testing procedure in the Neyman-Pearson approach (see Lehmann 
1986, Casella and Berger 2001, Robert 2001) 

For example, inference based on a likelihood ratio statistic requires com- 
putation of quantities such as 

Pe{L{e\X)/L{e^\X) < k) , 

with fixed and /c, where L(9\x) represents the likelihood based on observing 
X = X. Outside of the more standard (simple) settings, this probability cannot 
be explicitly computed because dealing with the distribution of test statistics 
under the alternative hypothesis may be quite difficult. A particularly delicate 
case is the Behrens-Fisher problem, where the above probability is difficult to 
evaluate even under the null hypothesis (see Lehmann 1986, Lee 2004). (Note 
that likelihood ratio tests cannot be rigorously classified as a likelihood-related 
approach, since they violate the Likelihood Principle, see Berger and Wolpert 
1988, but the latter does not provide a testing theory per se.) 



1.3 Bayesian Methods 



Whereas the difficulties related to maximum likelihood methods are mainly 
optimization problems (multiple modes, solution of likelihood equations, links 
between likelihood equations and global modes, etc.), the Bayesian approach 
more often results in integration problems. In the Bayesian paradigm, infor- 
mation brought by the data x, a realization oi X ^ f {x\0) , is combined with 
prior information that is specified in a prior distribution with density tt{6) 
and summarized in a probability distribution, 'k{0\x), called the posterior dis- 
tribution. This is derived from the joint distribution f{x\6)Tr{6), according to 
Bayes formula 



(1.16) 



7t{0\x) 



f{x\e)Tr{0) 

J f{x\9)ir{e)de' 



where m(x) = f f{x\$)Tr{0)d9 is the marginal density of X (see Berger 1985, 
Bernardo and Smith 1994, Robert 2001, for more details, in particular about 
the philosophical foundations of this inferential approach). 

For the estimation of a particular parameter h{9), the decision-theoretic 
approach to statistical inference (see, e.g. Berger 1985) requires the specifica- 
tion of a loss function L(^, 6), which represents the loss incurred by estimating 
h{0) with 6. The Bayesian version of this approach leads to the minimization 
of the Bayes risk. 
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j jL{s,e)f{x\e)7r{e)dxde, 

that is, the loss integrated against both X and 6. A straightforward inversion 
of the order of integration (Fubini’s theorem) leads to choosing the estimator 
S that minimizes (for each x) the posterior loss, 

(1.17) E[L((5, 0 )|; e ] = j L(<5, 9) t:{9\x) d9 . 

In the particular case of the quadratic loss 

L{5,e)^\\h{e)-d\\\ 



the Bayes estimator of h{6) is 5'^{x) = E'^[h{6)\x]. (See Problem 1.22.) 

Some of the difficulties related to the computation of S'^{x) are, first, that 
7t{6\x) is not generally available in closed form and, second, that in many 
cases the integration of h{9) according to 7t{6\x) cannot be done analytically. 
Loss functions L(5, 9) other than the quadratic loss function are usually even 
more difficult to deal with. 

The computational drawback of the Bayesian approach has been so great 
that, for a long time, the favored types of priors in a Bayesian modeling 
were those allowing explicit computations, namely conjugate priors. These 
are prior distributions for which the corresponding posterior distributions are 
themselves members of the original prior family, the Bayesian updating being 
accomplished through updating of parameters. (See Note 1.6.1 and Robert 
2001, Chapter 3, for a discussion of the link between conjugate priors and 
exponential families.) 



Example 1.11. Binomial Bayes estimator. For an observation X from 
the binomial distribution B{n,p), a family of conjugate priors is the family of 
Beta distributions Be{a,b). To find the Bayes estimator of p under squared 
error loss, we can find the minimizer of the Bayes risk, that is. 



min 

6 






x+g-l 

r{a)r{bf 



(1 - p^-^+^-^dp. 



Equivalently, we can work with the posterior expected loss (1-17) and find the 
estimator that yields 



min 

s 



r(^a b Ti) 

r{a -h x)r{n — X 4 - 6) 



f 



\p- 6 {x)Yp^-^^ (1-p) 



\n-x+b- 



^dp, 



where we note that the posterior distribution of p (given x) is Be{x + a, n — 
X -\-b). The solution is easily obtained through differentiation, and the Bayes 
estimator 6'^ is the posterior mean 



r(^a b Ti) 
r{a + x)r{n — X -\-b) 



f 



VP 



x+a— 1 



(1 -p) 



n-x+b-l^p = 



X -f a 
a + ^ + n* 
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The use of squared error loss results in the Bayes estimator being the mean of 
the posterior distribution, which usually simplifies calculations. If, instead, we 
had specified a absolute error loss \p — S{x)\ or had used a nonconjugate prior, 
the calculations could have become somewhat more involved (see Problem 
1.22). II 



The use of conjugate priors for computational reasons implies a restriction 
on the modeling of the available prior information and may be detrimental 
to the usefulness of the Bayesian approach as a method of statistical infer- 
ence. This is because it perpetuates an impression both of subjective “ma- 
nipulation” of the background (prior information) and of formal expansions 
unrelated to reality. The considerable advances of Bayesian decision theory 
have often highlighted the negative features of modeling using only conjugate 
priors. For example, Bayes estimators are the optimal estimators for the three 
main classes of optimality (admissibility, minimaxity, invariance), but those 
based on conjugate priors only partially enjoy these properties (see Berger 
1985, Section 4.7, or Robert 2001, Chapter 8) . 



Example 1.12. (Continuation of Example 1.8). For the estimation of 
A = II^IP, a reference prior^ on 9 is 7 t( 0) = (see Berger et al. 1998), 

with corresponding posterior distribution 



(1.18) 



tt{6\x) oc 



e-\\x-efn 



The normalizing constant corresponding to 7t{0\x) is not easily obtainable and 
the Bayes estimator of A, the posterior mean 

.... /«. 

cannot be explicitly computed. (See Problem 1.20.) || 



The computation of the normalizing constant of tt{9\x) is not just a for- 
mality. Although the derivation of a posterior distribution is generally done 
through proportionality relations, that is, using Bayes Theorem in the form 

7r{6\x) oc 7 t(0) f{x\6) , 

it is sometimes necessary to know the posterior distribution or, equivalently, 
the marginal distribution, exactly. For example, this is the case in the Bayesian 

^ A reference prior is a prior distribution which is derived from maximizing a dis- 
tance measure between the prior and the posterior distributions. When there is no 
nuisance parameter in the model, the standard reference prior is Jeffreys (1961) 
prior (see Bernardo and Smith 1994, Robert 2001, and Note 1.6.1.). 
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comparison of (statistical) models. If A4i,A42,--‘,A4k are possible models 
for the observation X, with densities if the associated parameters 

01 , 02 , • • • 5 are a priori distributed from tti, 7T2, . . . , tt^, and if these models 
have the prior weights pi,p 2 , • • • the posterior probability that X origi- 
nates from model M.j is (Problem 1.21) 

^ 20) / fj{x\0j) TrjjOj) dOj 

Z)i=i Pi I 7Ti(0i) d^i 

In particular, the comparison of two models hA\ and AI 2 is often implemented 
through the Bayes factor 

^ ^ f ^ 1 (^ 1 ) ddl 

^ ^ f / 2 m 7T2(02) (162 ’ 

for which the proportionality constant is quite important (see Kass and 
Raftery 1995 and Gout is and Robert 1998 for different perspectives on Bayes 
factors). Unsurprisingly, there has been a lot of research in the computation 
of these normalizing constants (Gelman and Meng 1998, Kong et al. 2003). 



Example 1.13. Logistic regression. A useful regression model for binary 
(0 — 1) responses is the logit models where the distribution of Y conditional on 
the explanatory (or dependent) variables x G is modeled by the relation 



( 1 . 21 ) 



P{Y = l)=p = 



exp(a + x/3) 

1 + exp(o; -h xP) ’ 



Equivalently, the logit transform of p, logit(p) = log[p/(l — p)], satisfies the 
linear relationship logit (p) = a xp. 

In 1986, the space shuttle Challenger exploded during take off, killing the 
seven astronauts aboard. The explosion was the result of an 0-ring failure, 
a splitting of a ring of rubber that seals the parts of the ship together. The 
accident was believed to be caused by the unusually cold weather (31^ F or 0^ 
C) at the time of launch, as there is reason to believe that the 0-ring failure 
probabilities increase as temperature decreases (Dalai et al. 1989). 



Flight 

Failure 

Temp. 



14 9 23 10 1 5 13 15 4 3 8 17 2 11 6 7 16 21 19 22 12 20 18 
11110000000 0 11000100000 
53 57 58 63 66 67 67 67 68 69 70 70 70 70 72 73 75 75 76 76 78 79 81 



Table 1.1. Temperature at flight time (degrees F) and failure of 0-rings (1 stands 
for failure, 0 for success). 



Data on previous space shuttle launches, and 0-ring failures, is given in 
Table 1.1. It is reasonable to fit a logistic regression, as in (1.21) with p = 
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Fig. 1.2. The figure shows the result of 10,000 Monte Carlo simulations of the 
model (1.21). The left panel shows the average logistic function and variation, the 
middle panel shows predictions of failure probabilities at 65^^ Fahrenheit, and the 
right panel shows predictions of failure probabilities at 45^^ Fahrenheit. 



probability of an 0-ring failure and x = temperature. The results are shown 
in Figure 1.2 for an exponential prior on log a and a flat prior on /?. 

The left panel in Figure 1.2 shows the logistic regression line, and the grey 
curves represent the results of a Monte Carlo simulation explained in Example 
7.3 from the posterior distribution of the model showing the variability in the 
data. It is clear that the ends of the function have little variability, while there 
is some in the middle. However, the next two panels are most important, as 
they show the variability in failure probability predictions at two different 
temperatures. The middle panel, which gives the failure probability at 65^ 
Fahrenheit, shows that, at this temperature, a failure is just about as likely as 
a success. However, at 45^ Fahrenheit, the failures are strongly skewed toward 
1. Given this trend, imagine what the failure probabilities look like at 31^ 
Fahrenheit, the temperature at Challenger launch time: At that temperature, 
failure was almost a certainty. 

The logistic regression Monte Carlo analysis of this data is quite straight- 
forward, and gives easy-to-understand answers to the relevant questions. Non- 
Monte Carlo alternatives would typically be based on likelihood theory and 
asymptotics, and would be more difficult to implement and interpret. || 

The computational problems encountered in the Bayesian approach are 
not limited to the computation of integrals or normalizing constants. For 
instance, the determination of confidence regions (also called credible regions) 
with highest posterior density^ 



c^{x) = {e',Ti{e\x)>k}, 
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requires the solution of the equation 7t{6\x) = k for the value of k that satisfies 
P{6 G C'^{x)\x) = P{7 t{0\x) > k\x) = 7 , 



where 7 is a predetermined confidence level. 



Example 1.14. Bayes credible regions. For iid observations Xi, . . . , 
from a normal distribution Af{6,a‘^), and a prior distribution 6 ~ 
the posterior density of 6 is normal with mean and variance 






nr^ 



nr'^ + ' 



and 



nr‘^ H- 



respectively. If we assume that and are known, the highest posterior 
density region is 



nr^ + 
27mr^ 



■ exp 



nr^ + (7^ 
2nr2 



{6 - 




Since the posterior distribution is symmetric and unimodal, this set is equiv- 
alent to (Problem 1.24) 



{(9; 6^ -k' <6 <6^ -\-k'}, 

for some constant k' that is chosen to yield a specified posterior probability. 
Since the posterior distribution of 0 is normal, this can be done by hand using 
a normal probability table. 

For the situation of Example 1.11, the posterior distribution of p, 7r(p| 
X, a, 6), was found to be Be(x+a, n— x+ 6), which is not necessarily symmetric. 
To find the 90% highest posterior density region for p, we must find limits 
l{x) and u{x) that satisfy 

nu(x) 

/ 7t(p|x, a, b)dp = .9 and 7 t(/(x)|x, a, b) = n{u{x)\x, a, b). 

Jl{x) 

This cannot be solved analytically. || 

Computation of a confidence region can be quite delicate when 7t{0\x) 
is not explicit. In particular, when the confidence region involves only one 
component of a vector parameter, calculation of tt{0\x) requires the integration 
of the joint distribution over all the other parameters. Note that knowledge 
of the normalizing factor is of minor importance in this setup. (See Robert 
2001, Chapter 6, for other examples.) 

Example 1.15. Cauchy confidence regions. Consider Xi, . . . , Xn, an iid 
sample from the Cauchy distribution C(0, cr), with associated prior distribution 
7 t( 0, cr) = a~^. The confidence region on 0 is then based on 
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1 + 



Xj~6 

a 



21-1 



da , 



an integral which cannot be evaluated explicitly. Similar computational prob- 
lems occur with likelihood estimation in this model. One method for obtaining 
a likelihood confidence interval for 6 is to use the profile likelihood 

£^{6\xi,. . . ,Xn) = max £{6,a\xi , . . . ,Xn) 

<T 



and consider the region {6 : . . . , > k}. Explicit computation is 

also difficult here. 1 1 



Example 1.16. Linear calibration. In a standard regression model, Y = 
a -\- fix there is interest in estimating or predicting features of Y from 
knowledge of x. In linear calibration models (see Osborne 1991, for an intro- 
duction and review of these models), the interest is in determining values of x 
from observed responses y. For example, in a chemical experiment, one may 
want to relate the precise but expensive measure y to the less precise but 
inexpensive measure x. A simplified version of this problem can be put into 
the framework of observing the independent random variables 

Y ~ A/'p(/?, cr'^Ip), Z ~ Np{xq( 3, a'^Ip), S ~ <T^Xg , 

with xo G M, G The parameter of interest is now xq and this problem is 
equivalent to Fieller (1954) problem (see, e.g Lehmann and Casella 1998). 

A reference prior on (xq,/?, cr) is given in Kubokawa and Robert (1994), 
and yields the joint posterior distribution 

TT{xo,(3,cr'^\y,z,s) oc exp{-(s+ \\y - Pf 

(1.22) +\\z - xo/3||^)/2ct^} (1 + xl)~^/‘^ . 



This can be analytically integrated to obtain the marginal posterior distribu- 
tion of Xq to be 
(1.23) 



TT{xo\y,z,s) oc 



( 1 + 

y^z Y . + « (y‘^)^ 

s+\\yP) 112/IP + s (5+112/IIT 



(2p+g)/2 • 



However, the computation of the posterior mean, the Bayes estimate of xq, is 
not feasible analytically; neither is the determination of the confidence region 
{7t{xo\V) > k}. Nonetheless, it is desirable to determine this confidence region 
since alternative solutions, for example the Fieller-Creasy interval, suffer from 
defects such as having infinite length with positive probability (see Gleser and 
Hwang 1987, Casella and Berger 2001, Ghosh et al. 1995, Philippe and Robert 
1998b). II 
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1.4 Deterministic Numerical Methods 

The previous examples illustrated the need for techniques, in both the con- 
struction of complex models and estimation of parameters, that go beyond the 
standard analytical approaches. However, before starting to describe simula- 
tion methods, which is the purpose of this book, we should recall that there 
exists a well-developed alternative approach for integration and optimization, 
based on numerical methods. We refer the reader to classical textbooks on 
numerical analysis (see, for instance, Fletcher 1980 or Evans 1993 for a de- 
scription of these methods, which are generally efficient and can deal with 
most of the above examples (see also Lange 1999 or Gentle 2002 for presen- 
tations in statistical settings). 



1.4.1 Optimization 



We briefly recall here the more standard approaches to optimization and in- 
tegration problems, both for comparison purposes and for future use. When 
the goal is to solve an equation of the form /(x) = 0, a common approach is 
to use a Newton-Raphson algorithm, which produces a sequence Xn such that 



(1.24) 



^n+1 — 



Of 



dx 



X — X'fi 



f{Xn) 



until it stabilizes around a solution of f{x) = 0. (Note that ^ is a matrix 
in multidimensional settings.) Optimization problems associated with smooth 
functions F are then based on this technique, using the equation VF{x) = 0, 
where VF denotes the gradient of F, that is, the vector of derivatives of F. 
(When the optimization involves a constraint G{x) = 0, F is replaced by a 
Lagrangian form F(x) — AG(x), where A is used to satisfy the constraint.) The 
corresponding techniques are then the gradient methods, where the sequence 
Xn is such that 

(1.25) Xn+I =Xn- ( VV‘F) {x„)VF{xn) , 

where VV^F denotes the matrix of second derivatives of F. 



Example 1.17. A simple Newton-Raphson Algorithm. As a simple il- 
lustration, we show how the Newton-Raphson algorithm can be used to And 
the square root of a number. If we are interested in the square root of b, this 
is equivalent to solving the equation 

f{x) = x^ — b = 0. 



Applying (1.24) results in the iterations 



^(j+1) _ J,{j) _ 



f'{xU)) 



= x^^'^ 



2x(j) 



— ^ ) 
2^ 
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f(x) 





values of 




values of h'(x) 



Fig. 1.3. Calculation of the root of f{x) = 0 and h' (x) = 0 for the functions 
defined in Example 1.17. The top left panel is /(x), and the top right panel shows 
that from different starting points the Newton-Raphson algorithm converges rapidly 
to the square root. The bottom left panel is h(x), and the bottom right panel shows 
that the Newton-Raphson algorithm cannot find the maximum of h(x), but rather 
converges to whatever mode is closest to the starting point. 



Figure 1.3 shows that the algorithm converges rapidly to the correct answer 
from different starting points (for 6 = 2, three runs are shown, starting at 
X = .5,2,4). 

However, if we consider the function 
(1.26) h{x) = [cos(50x) + sin(20x)]^, 

and try to derive its maximum, we run into problems. The “greediness” of the 
Newton-Raphson algorithm, that is, the fact that it always goes toward the 
nearest mode, does not allow it to escape from local modes. The bottom two 
panels in Figure 1.3 show the function, and the convergence of the algorithm 
from three different starting points, x — .25, .379, .75 (the maximum occurs 
at X = .379). From the figure we see that wherever we start the algorithm, it 
goes to the closest mode, which is often not the maximum. In Chapter 5, we 
will compare this solution to those of Examples 5.2 and 5.5 where the (global) 
maximum is obtained. || 

Numerous variants of Newton-Raphson- type techniques can be found in 
the literature, among which one can mention the steepest descent method. 
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where each iteration results in a unidimensional optimizing problem for F{xn-\- 
tdn) {t € M), dn being an acceptable direction, namely such that 



d^F 

~d^ 



{Xn “h tdn) 



t=0 



is of the proper sign. The direction dn is often chosen as VF or as the smoothed 
version of (1.25), 

[VV^F{Xn) + \l]~^VF{Xn), 

in the Levenherg-Marquardt version. Other versions are available that do not 
require differentiation of the function F. 



1.4.2 Integration 

Turning to integration, the numerical computation of an integral 

^ = f h{x)dx 

J a 

can be done by simple Riemann integration (see Section 4.3), or by improved 
techniques such as the trapezoidal rule 

^ n— 1 

y'(a;i+i - Xi){h{xi) + h{xi+i)) , 

i=l 

where the x^’s constitute an ordered partition of [a, 6], or yet Simpson’s rule^ 
whose formula is 

3 = ^ If{a)+4:'^h{x2i-i) + 2'^h{x2i) + f{b)\ 

^ I i=l i=l J 

in the case of equally spaced samples with (x^+i — Xi) = S. Other approaches 
involve orthogonal polynomials (Gram-Char Her, Legendre, etc.), as illustrated 
by Naylor and Smith (1982) for statistical problems, or splines (see Wahba 
1981, for a statistical connection). See also Note 2.6.2 for the quasi-Monte 
Carlo methods that, despite their name, pertain more to numerical integration 
than to simulation, since they are totally deterministic. However, due to the 
curse of dimensionality., these methods may not work well in high dimensions, 
as stressed by Thisted (1988). 

1.4.3 Comparison 

Comparison between the approaches, simulation versus numerical analysis, 
is delicate because both approaches can provide well-suited tools for many 
problems (possibly needing a preliminary study) and the distinction between 
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these two groups of techniques can be vague. So, rather than addressing the 
issue of a general comparison, we will focus on the requirements of each ap- 
proach and the objective conditions for their implementation in a statistical 
setup. 

By nature, standard numerical methods do not take into account the prob- 
abilistic aspects of the problem; that is, the fact that many of the functions 
involved in the computations are related to probability densities. Therefore, a 
numerical integration method may consider regions of a space which have zero 
(or low) probability under the distribution of the model, a phenomenon which 
usually does not appear in a simulation experiment.^ Similarly, the occurrence 
of local modes of a likelihood will often cause more problems for a determinis- 
tic gradient method than for a simulation method that explores high-density 
regions. (But multimodality must first be identified for these efficient methods 
to apply, as in Oh 1989 or Oh and Berger 1993.) 

On the other hand, simulation methods very rarely take into account the 
specific analytical form of the functions involved in the integration or op- 
timization, while numerical methods often use higher derivatives to provide 
bounds on the error of approximation. For instance, because of the random- 
ness induced by the simulation, a gradient method yields a much faster de- 
termination of the mode of a unimodal density. For small dimensions, inte- 
gration by Riemann sums or by quadrature converges faster than the mean 
of a simulated sample. Moreover, existing scientific software (for instance. 
Gauss, Maple, Mathematica, Matlab, R) and scientific libraries like IMSL of- 
ten provide highly efficient numerical procedures, whereas simulation is, at 
best, implemented through pseudo-random generators for the more common 
distributions. However, software like BUGS (see Note 10.6.2) are progressively 
bridging the gap. 

Therefore, it is often reasonable to use a numerical approach when dealing 
with regular functions in small dimensions and in a given single problem. On 
the other hand, when the statistician needs to study the details of a likelihood 
surface or posterior distribution, or needs to simultaneously estimate several 
features of these functions, or when the distributions are highly multimodal 
(see Examples 1.2 and 1.8), it is preferable to use a simulation-based approach. 
Such an approach captures (if only approximately through the generated sam- 
ple) the different characteristics of the density and thus allows, at little cost, 
extensions of the inferential scope to, perhaps, another test or estimator. 

However, given the dependence on specific problem characteristics, it is 
fruitless to advocate the superiority of one method over the other, say of the 
simulation-based approach over numerical methods. Rather, it seems more 
reasonable to justify the use of simulation-based methods by the statistician 
in terms of expertise. The intuition acquired by a statistician in his or her 

* Simulation methods using a distribution other than the distribution of interest, 
such as importance sampling (Section 3.3) or Metropolis-Hastings algorithms 
(Chapter 7), may suffer from such a drawback. 
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everyday processing of random models can be directly exploited in the im- 
plementation of simulation techniques (in particular in the evaluation of the 
variation of the proposed estimators or of the stationarity of the resulting 
output), while purely numerical techniques rely on less familiar branches of 
mathematics. Finally, note that many desirable approaches are those which 
efficiently combine both perspectives, as in the case of simulated annealing 
(see Section 5.2.3) or Riemann sums (see Section 4.3). 

1.5 Problems 



1.1 For both the censored data density (1.1) and the mixture of two normal distri- 
butions (1.15), plot the probability density function. Use various values for the 
parameters /x, 6, a and r. 

1.2 In the situation of Example 1.1, establish that the densities are indeed (1.1) 
and (1.2). 

1.3 In Example 1.1, the distribution of the random variable Z = min(X, F) was 
of interest. Derive the distribution of Z in the following case of informative 
censoring, where Y ~ Af{0,a^) and X ~ J\f {0 , 0^ a‘^) . Pay attention to the 
identifiability issues. 

1.4 In Example 1.1, show that the integral 

noo 

J (jJ 

can be explicitly calculated. {Hint: Use a change of variables.) 

1.5 For the model (1.4), show that the density of (Xo, . . . , Xn) is given by (1.5). 

1.6 In the setup of Example 1.2, derive the moment estimator of the weights 
(pi, . . . ,Pfc) when the densities fj are known. 

1.7 In the setup of Example 1.6, show that the likelihood equations are given by 
(1.11) and that their solution is the standard {y, s^) statistic. 

1.8 (Titterington et al. 1985) In the case of a mixture of two exponential distribu- 
tions with parameters 1 and 2, 

nSxp{l) + (1 — 7r)Sxp{2) , 

show that E[X®] = {tt -h (1 — 7r)2~^}r{s -h 1). Deduce the best (in s) moment 
estimator based on ts{x) = x^ /r{s -f 1). 

1.9 Give the moment estimator for a mixture of k Poisson distributions, based 
on ts{x) = x{x — 1) • • • (x — s -h 1). {Note: Pearson 1915 and Gumbel 1940 
proposed partial solutions in this setup. See Titterington et al. 1985, pp. 80-81, 
for details.) 

1.10 In the setting of Example 1.9, plot the likelihood based on observing (xi, X 2 , xs) 
(0, 5, 9) from the Student’s t density (1.14) with p = 1 and a = 1 (which is the 
standard Cauchy density). Observe the effect on multimodality of adding a 
fourth observation X4 when X 4 varies. 

1.11 The Weibull distribution We{a, c) is widely used in engineering and reliability. 
Its density is given by 



/(x|a,c)=CQ! ^{x/ay ^e 
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(a) Show that when c is known, this model is equivalent to a Gamma model. 

(b) Give the likelihood equations in a and c and show that they do not allow 
for explicit solutions. 

(c) Consider an iid sample Xi , . . . , Xn from We(a, c) censored from the right in 
yo. Give the corresponding likelihood function when a and c are unknown 
and show that there is no explicit maximum likelihood estimator in this 
case either. 

1.12 (Continuation of Problem 1 . 11 ) Show that the cdf of the Weibull distribution 
We{a, /3) can be written explicitly, and show that the scale parameter a deter- 
mines the behavior of the hazard rate h{t) = i-rlt) ’ where / and F are the 
density and the cdf, respectively. 

1.13 (Continuation of Problem 1.11) The following sample gives the times (in days) 
at which carcinoma was diagnosed in rats exposed to a carcinogen: 

143, 164, 188, 188, 190, 192, 206, 209, 213, 216, 220, 

227, 230, 234, 246, 265, 304, 216*, 244*, 

where the observations with an asterisk are censored (see Pike 1966, for details). 
Fit a three parameter Weibull We(a, /3, 7 ) distribution to this dataset, where 7 is 
a translation parameter, for (a) 7 = 100 and a = 3, (b) 7 = 100 and a unknown 
and (c) 7 and a unknown. {Note: Treat the asterisked observations as ordinary 
data here. See Problem 5.24 for a method of dealing with the censoring.) 

1.14 Let Xi,X 2 ,...,Xn be iid with density f{x\6,a), the Cauchy distribution 

C{0,a), and let L(^,cr|x) = the likelihood function. 

(a) If a is known, show that a solution to the likelihood equation -^L{9, cr|x) = 
0 is the root of a 2n — 1 degree polynomial. Hence, finding the likelihood 
estimator can be challenging. 

(b) For n = 3, if both 0 and a are unknown, find the maximum likelihood 
estimates and show that they are unique. 

(c) For n > 3, if both 9 and a are unknown, show that the likelihood is uni- 
modal. 

{Note: See Copas 1975 and Ferguson 1978 for details.) 

1.15 Referring to Example 1.16, show that the posterior distribution (1.22) can be 
written in the form (1.23). 

1.16 Consider a Bernoulli random variable Y 

(a) If 2 / = 0, show that the maximum likelihood estimator of ^ is 00 . 

(b) Show that the same problem occurs when Yi,Y 2 ~ B{[1 + and yi = 

y 2 = 0 or yi = y 2 = 1. Give the maximum likelihood estimator in the other 
cases. 

1.17 Consider n observations a^i, . . . ,a:n from B{k^p) where both k and p are un- 
known. 

(a) Show that the maximum likelihood estimator of k, k, satisfies 

n n 

(^(1 - p)y > JJ(fc - Xi) and {{k -h 1)(1 - p))'^ < J^(fc + 1 - Xi), 

i=l 1=1 

where p is the maximum likelihood estimator of p. 

(b) If the sample is 16, 18, 22, 25, 27, show that k = 99. 

(c) If the sample is 16, 18, 22, 25, 28, show that k = 190. Discuss the stability 
of the maximum likelihood estimator. 




1.5 Problems 



25 



{Note: Olkin et al. 1981 were one of the first to investigate the stability of the 
MLE for the binomial parameter n; see also Carroll and Lombard 1985, Casella 
1986, and Hall 1994.) 

1.18 (Robertson et al. 1988) For a sample Xi, . . . , Xn, and a function / on the 
isotonic regression of / with weights cji is the solution of the minimization in g 
of 

n 

i=l 



under the constraint g(xi) < • • • < g(xn)- 

(a) Show that a solution to this problem is obtained by the pool- adjacent- 
violators algorithm: 

Algorithm A.l -Pool-adjacent- viol at ors- 

If / is not isotonic, find i such that 
f{xi^i) > f(xi), replace /(Xt-i) and f{xi) by 

rixi ) = 

and repeat until the constraint is satisfied. Take g = f* , 

(b) Apply this algorithm to the case n = 4, /(xi) — 23, /(X 2 ) = 27, /(xa) = 25, 
and /(X4) = 28, when the weights are all equal. 

1.19 (Continuation of Problem 1.18) The simple tree ordering is obtained when 
one compares treatment effects with a control state. The isotonic regression is 
then obtained under the constraint g{xi) > g{x\) for z = 2, . . . , n. 



(a) Show that the following provides the isotonic regression g*: 
Algorithm A. 2 —Tree order ing- 

If / is not isotonic, assume w.l.o.g. 

that the /(a:j)*s axe in increasing order (z>2). 

Find the smallest j such that 



+ 



< /(®j+i) - 



take p*(a;i) = =3*(Xj), ff*(xj+i) = /(xj+i) 

(b) Apply this algorithm to the case where n = 5, /(xi) = 18, /(X 2 ) = 17, 
/(xs) = 12, /{xa) = 21, /(xs) = 16, with o;i = W2 = ws = 1 and uiz = 



a;4 = 3. 



1.20 For the setup of Example 1.8, where X ~ Afp{0, Ip): 

(a) Show that the maximum likelihood estimator of A = ||^|p is A(x) = ||x|p 
and that it has a constant bias equal to p. 

(b) If we observe Y = ||A|p, distributed as a noncentral chi squared random 
variable (see Appendix A), show that the MLE of A is the solution in (1.13). 
Discuss what happens ii y < p. 

(c) In part (a), if the reference prior 7t(0) = is used, show that the 

posterior distribution is given by (1.18), with posterior mean (1.19). 

1.21 Show that under the specification Oj ~ tt^ and X ~ fj{x\0j) for model Mj, 
the posterior probability of model M.j is given by (1.20). 

1.22 Suppose that X ~ /(x|^), with prior distribution 7 t( 0), an interest is in the 
estimation of the parameter h{6). 
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(a) 



Using the loss function L{S,h{9)), show that the estimator that minimizes 
the Bayes risk 




L{6,h{e))f{x\0)7r(e)dxde 



is given by the estimator ^ that minimizes (for each x) 



J L{6,h{e))n(e\x)d0 . 

(b) For L((5, 0) = \\h{0) — (5||^ , show that the Bayes estimator of h{9) is S'^{x) = 
E^[h{9)\x]. 

(c) For L((5, 9) = \h{9) — S\ , show that the Bayes estimator of h{9) is the median 
of the posterior distribution. 

1.23 For each of the following cases, give the posterior and marginal distributions. 

(a) X|a~ V(0 ,(t"), 1/(t 2 ~e(l,2); 

(b) X|A~P(A), A~a(2,l); 

(c) X\p Afeg{lQ,p), p ~ Se(l/2, 1/2). 

1.24 Let f{x) be a unimodal continuous density, and for a given value of o, let the 

interval [a, b] satisfy f^f = l — a. 

(a) Show that the shortest interval satisfying the probability constraint is given 
by /(a) = /(&), where a and b are on each side of the mode of /. 

(b) Show that if / is symmetric, then the shortest interval satisfies a = —b. 

(c) Find the 90% highest posterior credible regions for the posterior distribu- 
tions of Problem 1.23. 

1.25 (Bauwens 1991) Consider Xi, . . . , Xn hd Af{9, cr'^) with prior 



7r{9,a^) = a exp(— S o/2(t^). 

(a) Compute the posterior distribution 7 t(^, . . . , Xn) and show that it 

depends only on x and 

(b) Derive the posterior expectation E'^[a‘^\xi , . . . , Xn] and show that its behav- 
ior when a and so both converge to 0 depends on the limit of (so/a) — 1. 

1.26 In the setting of Section 1.3, 

(a) Show that, if the prior distribution is improper, the marginal distribution 
is also improper. 

(b) Show that if the prior 7t{9) is improper and the sample space X is finite, 
the posterior distribution 7t{9\x) is not defined for some value of x. 

(c) Consider Xi, . . . , Xn distributed according to Af(9j, 1), with 9j ^ cr^) 
(1 < j < n) and 7r{fi,a^) = a~^. Show that the posterior distribution 
7r(/i, cr^|xi, . . . , Xn) is not defined. 

1.27 Assuming that Tr(^) = 1 is an acceptable prior for real parameters, show that 
this generalized prior leads to 7r(cr) = 1/cr if a G and to 7r(p) = 1/^(1 — q) 
if ^ G [0, 1] by considering the “natural” transformations 9 — log(cr) and 9 = 

log(^/(l - ^))- 

1.28 For each of the following situations, exhibit a conjugate family for the given 
distribution: 

(a) X - g{9,/3); that is, fp{x\9) = ir{9). 

(b) X~Be(l,6>),6>GN. 

1.29 Show that, if X ~ Be(^i,^ 2 ), there exist conjugate priors on ^ = (^ 1 ,^ 2 ) but 

that they do not lead to tractable posterior quantities, except for the computa- 
tion of E^[9i/{9i 92 )\x]. 
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1.30 Consider Xi , . . . , Xn ~ with 7 t(/x, i/, a) oc 1/a. 

(a) Show that the posterior distribution is not defined for every n. 

(b) Extend this result to overparameterized models with improper priors. 

1.31 Consider estimation in the linear model 



Y = 6iXi + 62X2 + €, 

under the constraint 0 < 61, 62 < 1, for a sample (Fi, Xn, X21), . . . , (yn,Xin, 
X2n) when the errors ei are independent and distributed according to A 7 ( 0 , 1). 
A noninformative prior is 



7 t( 6 i, 52 ) = I[0,1](^i)I[0,1](^2) . 

(a) Show that the posterior means are given by (z = 1, 2 ) 

fo fo n;=i - b2X2j)dbi db2 



E’'[6<|j/i, . . . ,3/„] = 



fo fo n;=i 'piyj - biXij - b2X2j)dbi db2 



where (p is the density of the standard normal distribution. 

(b) Show that an equivalent expression is 

[( 6 i,b 2 )e [ 0 ,iPl 3 /i,..., 2 /»]’ 
where the right-hand term is computed under the distribution 

with (61,62) the unconstrained least squares estimator of (61,62) and 



/Xu X2l\ 

^=h ; • 

\Xin X2nJ 

(c) Show that the posterior means cannot be written explicitly, except in the 
case where (X^X) is diagonal. 

1.32 (Berger 1985 ) Consider the hierarchical model 

X\e^Afp{9,a^Ip), 
e\^ allp), 

C ~ ^(^0,1-^) , <T^ ~ 7 T 2 (cr^) 

where 1 = (1, . . . , 1)* € R’’, and ^o, and are fixed. 

(a) Show that 



6(x|{, a^) = x- — (x - $1), 
a-‘ +cr4 



T^2{^,o-l\x) on + al) *’^^exp- 


f Ik-ain -«-€o)V2r^^ ( 2 
t 2(cr2+a2)J 


oc expl 


p{x-0^+s^ 


(^-^o)M 


(a^ + a^)p/^ 


2(cr2 +ct2) 


2r2 / 



with ~ Deduce that 'K2{^\cr‘i ^ x) is a normal distribution. 

Give the mean and variance of the distribution. 
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(b) Show that 
and 



0-2 +^2 



{x — xl) 



^ + ^7T 

+ cr? + 



(^-6)1 



'K2{(tI\x) OC 



r exp — I 



P(a: - 6)^ 



pr^ + cr-^ + crj 



(0-2 + (T2){P-l)/2(a2 + (tI + pr2)l/2 

(c) Deduce the representation 

^2 

(a: — xl) 









cr2 + 0-2 

^2 , ^2 
cr 4 - cr^r 



[erf + 0-2 +pr 2 J 



(x-<fo)l. 



and discuss the appeal of this expression from an integration point of view. 

1.33 A classical linear regression can be written as Y r\j Afp{Xf3^a^Ip) with X a 
p X q matrix and f3 eW^. 

(a) When X is known, give the natural parameterization of this exponential 
family and derive the conjugate priors on 

(b) Generalize to J\fp{XP, X). 

1.34 ^ An autoregressive model AR(1) connects the random variables in a sam- 
ple Ai, . . . , Xn through the relation At+i = gXt + et, where et ~ A/^(0, is 
independent of Xt. 

(a) Show that the At’s induce a Markov chain and derive a stationarity condi- 
tion on Q. Under this condition, what is the stationary distribution of the 
chain? 

(b) Give the covariance matrix of (Ai, . . . , An). 

(c) If xo is a (fixed) starting value for the chain, express the likelihood func- 
tion and derive a conjugate prior on (g,a^). (Hint: Note that At|xf_i ~ 
J\f(gxt-i,cr^).) 

Note: The next four problems involve properties of the exponential family, con- 
jugate prior distributions, and Jeffreys prior distributions. Brown (1986) is a 
book- length introduction to exponential families, and shorter introductions can 
be found in Casella and Berger (2001, Section 3.3), Robert (2001, Section 3.2), 
or Lehmann and Casella (1998, Section 1.5). For conjugate and Jeffreys pri- 
ors, in addition to Note 1.6.1, see Berger (1985, Section 3.3) or Robert (2001, 
Sections 3.3 and 3.5). 

1.35 Consider x == (xij) and E = {(Jij) symmetric positive-definite m x m matrices. 
The Wishart distribution, Wm(o;, A), is defined by the density 



Poc,e{x) 



|x| ^2 ^ exp( — (tr(A ^x)/2) 

r^(a)|A|-/2 



with tr(A) the trace of A and 

® This problem requires material that will be covered in Chapter 6. It is put here 
for those already familiar with Markov chains. 
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Fmia) = J]: F 

i=l 

(a) Show that this distribution belongs to the exponential family. Give its nat- 
ural representation and derive the mean of Wm(Q', E). 

(b) Show that, if Zi, . . . , Zn ~ A/^m(0, Z*), 




n 

^ZiZ'~Wm(n, S). 

i=l 



1.36 Consider X ~ J^{0, 6) with ^ > 0. 

(a) Indicate whether the distribution of X belongs to an exponential family 
and derive the conjugate priors on 6. 

(b) Determine the Jeffreys prior 

1.37 Show that a Student’s t distribution Tp{iy, 9^ r^) does not allow for a conjugate 
family, apart from .Fo, the (trivial) family that contains all distributions. 

1.38 (Robert 1991) The generalized inverse normal distribution ZA/^(a, /x, r) has 
the density 

'K'(a>M,'r)|yr“exp |- Q - /2r^ 

with a > 0, ^ G M, and r > 0. 

(a) Show that this density is well defined and that the normalizing factor is 



F ,F^ 1/2; 

where iFi is the confluent hypergeometric function 

r. ^F{a + k)F{b)z'^ 



(see Abramowitz and Stegun 1964). 

(b) If X ~ show that the distribution of 1/X is in the XA/"(a,/z, r) 

family. 

(c) Deduce that the mean of T ^ //, r) is defined for a > 2 and is 



iFi(^;3/2;mV2t^) 

iFi(^;1/2;/.V2t2)' 



(d) Show that 6 ~ /j,,r) constitutes a conjugate family for the multi- 

plicative model X ~ Af{9y 0^). 

1.39 Recall the situation of Example 1.8 (see also Example 1.12), where X ^ 

(a) For the prior 7 t(A) = 1/a/A, show that the Bayes estimator of A = ||^||^ 
under quadratic loss can be written as 

iFi(3/2;p/2;||x||V2) 

iFi(1/2;p/2;||x||V2)’ 

where iFi is the confluent hypergeometric function. 
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(b) Using the series development of iFi in Problem 1.38, derive an asymptotic 
expansion of (5^ (for ||a:|p — > +oo) and compare it with So{x) = ||x|p — p. 

(c) Compare the risk behavior of the estimators of part (b) under the weighted 
quadratic loss 



L{S,9) = 



2\\9\\^+p ■ 



1.40 (Smith and Makov 1978) Consider the mixture density 



k 

X ~ /(a;|p) = ^Pifiix), 

i=l 



where pi > 0, = 1, and the densities fi are known. The prior 7r(p) is a 

Dirichlet distribution T>{ai , . . . , ak)- 

(a) Explain why the computing time could get prohibitive as the sample size 
increases. 

(b) A sequential alternative which approximates the Bayes estimator is to re- 
place 7 t(p|xi, . . . ,Xn) by with 

+ P(Zni = l|x„), . . . = 4”“^^ + P{Znk = lk„), 

and Zni {1 < i < k) is the component indicator vector of Xn. Justify this 
approximation and compare with the updating of 7t{6\xi, . . . ,Xn-i) when 
Xn is observed. 

(c) Examine the performances of the approximation in part (b) for a mixture 
of two normal distributions A/’(0, 1) and A/’(2, 1) when p = 0.1, 0.25, and 0.5. 

(d) If Trf = P{Zni — l|xn), show that 

p\”\xn) =pl"“^’(a:n-i) -<}, 

where p\^^ is the quasi-Bayesian approximation of E’^(pi|xi, . . . , Xn)- 



1.6 Notes 



1.6.1 Prior Distributions 

(i) Conjugate Priors 

When prior information about the model is quite limited, the prior distribution 
is often chosen from a parametric family. Families P that are closed under 
sampling (that is, such that, for every prior tt e P, the posterior distribution 
7t{9\x) also belongs to P) are of particular interest, for both parsimony and 
invariance motivations. These families are called conjugate families. Most often, 
the main motivation for using conjugate priors is their tract ability; however, 
such choices may constrain the subjective input. 

For reasons related to the Pitman-Koopman Lemma (see the discussion follow- 
ing Example 1.8), conjugate priors can only be found in exponential families. 
In fact, if the sampling density is of the form 



(1.27) 



f{x\6) = C{6)h{x) ex.p{R{9) -T(x)}, 
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which include many common continuous and discrete distributions (see Brown 
1986), a conjugate family for f{x\6) is given by 

since the posterior distribution is 7r{0\fi -f x, A + 1). Table 1.2 presents some 
standard conjugate families. 



Distribution 


Sample Density 


Prior Density 


Normal 

Normal 

Poisson 

Gamma 

Binomial 

Multinomial 


V{9) 

B{n,9) 

Mk{9i, . ■ . ,9k) 


9 ~ A/'(m, r^) 

9 ~ G{o^,l3) 

9 ~ G{oi,P) 

9 ~ G{ol,I3) 

9 ~ Be{a, /3) 

9i,. . . ,9k ^ . . . , afc) 



Table 1.2. Some conjugate families of distributions. 



Extensions of (1.27) which allow for parameter-dependent support enjoy most 
properties of the exponential families. In particular, they extend the applicabil- 
ity of conjugate prior analysis to other types of distributions like the uniform 
or the Pareto distribution. (See Robert 1994a, Section 3.2.2.) 

Another justification of conjugate priors, found in Diaconis and Ylvisaker 
(1979), is that some Bayes estimators are then linear. If ^(9) = which 

is equal to V'il){9)^ the prior mean of ^(^) for the prior 7r(^|/i, A) is xo/X and if 
xi, . . . , Xn are iid f{x\0), 



E"[^(^)|xi,...,Xn] 



xo H" nx 
X n 



For more details, see Bernardo and Smith (1994, Section 5.2). 

(ii) Noninformative Priors 

If there is no strong prior information, a Bayesian analysis may proceed with a 
“noninformative” prior; that is, a prior distribution which attempts to impart 
no information about the parameter of interest. A classic noninformative prior 
is the Jeffreys prior (Jeffreys 1961). For a s ampling density f(x\9), this prior 
has a density that is proportional to y/\I{9)\, where |/(0)| is the determinant of 
the Fisher information matrix. 



I{9) = Ee 



09 



^og f{X\9) 



—Ee 



^^ogf{x\e) 



equal to the variance of the score vector. 

For further details see Berger (1985), Bernardo and Smith (1994), Robert 
(1994a), or Lehmann and Casella (1998). 

(Hi) Reference Priors 

An alternative approach to constructing a “noninformative” prior is that of 
reference priors (Bernardo 1979, Berger and Bernardo 1992). We start with 
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Kullback-Leibler information, K[f,g], also known as Kullback-Leibler infor- 
mation for discrimination between two densities. For densities / and g, it is 
given by 






f{t) dt. 



The interpretation is that as K[f^g] gets larger, it is easier to discriminate 
between the densities / and g. 

A reference prior can be thought of as the density 7t(-) that maximizes (asymp- 
totically) the expected Kullback-Leibler information (also known as Shannon 
information) 

J K[7r{6\x),7r{0)]m-jr{x) dx, 



where m7r(x) = f f{x\0)7v{O) dO is the marginal distribution. 

Further details are given in Bernardo and Smith (1994) and Robert (2001, Sec- 
tion 3.5) and there are approximations due to Clarke and Barron (1990) and 
Clarke and Wasserman (1993). 



1.6.2 Bootstrap Methods 



Bootstrap (or resampling) techniques are a collection of computationally intensive 
methods that are based on resampling from the observed data. They were first in- 
troduced by Efron (1979) and are described more fully in Efron (1982), Efron and 
Tibshirani (1994), or Hjorth (1994). (See also Hall 1992 and Barbe and Bertail 1995 
for more theoretical treatments.) Although these methods do not call for, in princi- 
ple, a simulation-based implementation, in many cases where their use is particularly 
important, intensive simulation is required. The basic idea of the bootstrap^® is to 
evaluate the properties of an arbitrary estimator 0{x \^ . . . , Xn) through the empirical 
cdf of the sample Ai, . . . , An, 



n ^ ^ 

i=l 

instead of the theoretical cdf F. More precisely, if an estimate of 0{F) = f h{x)dF{x) 
is desired, an obvious candidate is 0{Fn) = / h{x)dFn{x) . When the Ai’s are iid, 
the Glivenko-Cantelli Theorem (see Billingsley 1995) guarantees the sup-norm con- 
vergence of Fn to F, and hence guarantees that 9{Fn) is a consistent estimator of 
9{F). The bootstrap provides a somewhat “automatic” method of computing ^(Fn), 
by resampling the data. 

It has become common to denote a bootstrap sample with a superscript , so we 
can draw bootstrap samples 

x*^-(Ar,...,A:)-Fn, iid. 

(Note that the A*’s are equal to one of the Xj’s and that a same value Xj can appear 
several times in X*L) Based on drawing X*^, . . . , X*^, 6{Fn) can be approximated 
by the bootstrap estimator 

This name comes from the German novel Adventures of Baron Munchausen by 
Rudolph Raspe where the hero saves himself from drowning by pulling on his 
own... bootstraps! 
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(1-28) 

i=l 

with the approximation becoming more accurate as B increases. 

If 0 is an arbitrary estimator of 0{F), the bias, the variance, or even the error 
distribution, of 6, 

Ef[0 - 0(F)], va .1 f(0) , and Pf(0 - 6(F) < u), 

can then be approximated by replacing F with Fn. For example, the bootstrap 
estimator of the bias will thus be 



- 0(Fn)] 

1 B 

i=l 

where 0(F*i) is constructed as in (1.28), and a confidence interval [6 - 0,0 - a] on 
0 can be constructed by imposing the constraint 

PFja < 0(Fn) - 0(Fn) < 0) = c 

on (a, 0), where c is the desired confidence level. There is a huge body of literature, 
not directly related to the purpose of this book, in which the authors establish differ- 
ent optimality properties of the bootstrap estimates in terms of bias and convergence 
(see Hall 1992, Efron and Tibshirani 1994, Lehmann 1998). 

Although direct computation of 0 is possible in some particular cases, most 
setups require simulation to approximate the distribution of ^ - 0(Fn). Indeed, the 
distribution of (Xi , . . . ,X*) has a discrete support {xi,. . . ,Xn}’^, but the cardinality 
of this support, , increases much too quickly to permit an exhaustive processing of 
the points of the support even for samples of average size. There are some algorithms 
(such as those based on Gray Codes', see Diaconis and Holmes 1994) which may allow 
for exhaustive processing in larger samples. 



11 



The May 2003 issue of Statistical Science is devoted to the bootstrap, providing 
both an introduction and overview of its properties, and an examination of its 
influences in various areas of Statistics. 
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Random Variable Generation 



“Have you any thought,” resumed Valentin, “of a tool with which it could 
be done?” 

“Speaking within modern probabilities, I really haven’t,” said the doctor. 
— G.K. Chesterton, The Innocence of Father Brown 



The methods developed in this book mostly rely on the possibility of produc- 
ing (with a computer) a supposedly endless flow of random variables (usually 
iid) for well-known distributions. Such a simulation is, in turn, based on the 
production of uniform random variables. Although we are not directly con- 
cerned with the mechanics of producing uniform random variables (see Note 
2.6.1), we are concerned with the statistics of producing uniform and other 
random variables. 

In this chapter we first consider what statistical properties we want a se- 
quence of simulated uniform random variables to have. Then we look at some 
basic methodology that can, starting from these simulated uniform random 
variables, produce random variables from both standard and nonstandard dis- 
tributions. 



2.1 Introduction 

Methods of simulation are based on the production of random variables, origi- 
nally independent random variables, that are distributed according to a distri- 
bution / that is not necessarily explicitly known (see, for example. Examples 
1.1, 1.2, and 1.3). The type of random variable production is formalized below 
in the deflnition of a pseudo-random number generator. We first concentrate 
on the generation of random variables that are uniform on the interval [0, 1], 
because the uniform distribution ^o,i] provides the basic probabilistic rep- 
resentation of randomness and also because all other distributions require a 
sequence of uniform variables to be simulated. 
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2.1.1 Uniform Simulation 

The logical paradox^ associated with the generation of “random numbers” is 
the problem of producing a deterministic sequence of values in [0, 1] which 
imitates a sequence of iid uniform random variables ^o,i]- (Techniques based 
on the physical imitation of a “random draw” using, for example, the internal 
clock of the machine have been ruled out. This is because, first, there is no 
guarantee on the uniform nature of numbers thus produced and, second, there 
is no reproducibility of such samples.) However, we really do not want to enter 
here into the philosophical debate on the notion of “random,” and whether 
it is, indeed, possible to “reproduce randomness” (see, for example, Chaitin 
1982, 1988). 

For our purposes, there are methods that use a fully deterministic pro- 
cess to produce a random sequence in the following sense: Having generated 
(Xi, . . . , Xn)j knowledge of Xn [or of (Xi, . . . , Xn)] imparts no discernible 
knowledge of the value of if the transformation function is not avail- 

able. Of course, given the initial value Xq and the transformation function, 
the sample (Xi, . . . , X^) is always the same. Thus, the “pseudo-randomness” 
produced by these techniques is limited since two samples (Xi, . . . ,X^) and 
(Vi, . . . ,Vn) produced by the algorithm will not be independent, nor identi- 
cally distributed, nor comparable in any probabilistic sense. This limitation 
should not be forgotten: The validity of a random number generator is based 
on a single sample Xi,...,X^ when n tends to -hcx) and not on replica- 
tions (Xn, . . . ,Xin), (X 21 , . . . ,X 2 n), . . . (Xfci, . . . ,Xfcn), where n is fixed and 
k tends to infinity. In fact, the distribution of these n-tuples depends only on 
the manner in which the initial values Xri {I < r < k) were generated. 

With these limitations in mind, we can now introduce the following oper- 
ational definition, which avoids the difficulties of the philosophical distinction 
between a deterministic algorithm and the reproduction of a random phe- 
nomenon. 

Definition 2.1. A uniform pseudo-random number generator \s an algorithm 
which, starting from an initial value uq and a transformation D, produces a 
sequence (ui) = {D^{uq)) of values in [0, 1]. For all n, the values (i^i, . . . ,Un) 
reproduce the behavior of an iid sample (Vi, . . . , W) of uniform random vari- 
ables when compared through a usual set of tests. 

This definition is clearly restricted to testable aspects of the random vari- 
able generation, which are connected through the deterministic transformation 

^ Von Neumann (1951) summarizes this problem very clearly by writing “Any one 
who considers arithmetical methods of reproducing random digits is, of course, in 
a state of sin. As has been pointed out several times, there is no such thing as 
a random number— there are only methods of producing random numbers, and a 
strict arithmetic procedure of course is not such a method. ” 
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— D{ui-i). Thus, the validity of the algorithm consists in the verification 
that the sequence f/i, . . . , leads to acceptance of the hypothesis 

Ho : C/i, . . . ,[/n are iid ^o,i]* 

The set of tests used is generally of some consequence. There are classical 
tests of uniformity, such as the Kolmogorov-Smirnov test. Many generators 
will be deemed adequate under such examination. In addition, and perhaps 
more importantly, one can use methods of time series to determine the de- 
gree of correlation between between Ui and {Ui-i, . . . by using an 

ARMA(p, q) model, for instance. One can use nonparametric tests, like those 
of Lehmann (1975) or Randles and Wolfe (1979), applying them on arbitrary 
decimals of Ui. Marsaglia^ has assembled a set of tests called Die Hard. 

Definition 2.1 is therefore functional: An algorithm that generates uniform 
numbers is acceptable if it is not rejected by a set of tests. This methodology 
is not without problems, however. Consider, for example, particular applica- 
tions that might demand a large number of iterations, as the theory of large 
deviations (Bucklew 1990), or particle physics, where algorithms resistant to 
standard tests may exhibit fatal faults. In particular, algorithms having hid- 
den periodicities (see below) or which are not uniform for the smaller digits 
may be difficult to detect. Ferrenberg et al. (1992) show, for instance, that 
an algorithm of Wolff (1989), reputed to be ‘‘good,” results in systematic bi- 
ases in the processing of Ising models (see Example 5.8), due to long-term 
correlations in the generated sequence. 

The notion that a deterministic system can imitate a random phenomenon 
may also suggest the use of chaotic models to create random number gener- 
ators. These models, which result in complex deterministic structures (see 
Berge et al. 1984, Gleick 1987, Ruelle 1987) are based on dynamic systems of 
the form Xn+i = D{Xn) which are very sensitive to the initial condition Xq. 

Example 2.2. The logistic function. The logistic function Da{x) = 
ax{l — x) produces, for some values of a G [3.57,4.00], chaotic configura- 
tions. In particular, the value a = 4.00 yields a sequence (Xn) in [0, 1] that, 
theoretically, has the same behavior as a sequence of random numbers (or 
random v ariables) distributed according to the arcsine distribution with den- 
sity l/7ry^x(l — x). (See Problem 2.4 for another random number generator 
based on the “tent” function.) 

Although the limit distribution (also called the stationary distribution) 
associated with a dynamic system = D{Xn) is sometimes defined and 

known, the chaotic features of the system do not guarantee acceptable behav- 
ior (in the probabilistic sense) of the associated generator. Figure 2.1 illus- 
trates the properties of the generator based on the logistic function The 

These tests are now available as a freeware on the site 

http : //stat . f su . edu/~geo/diehard . html 
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Fig. 2.1. Plot of the sample (^n, 2/n+ioo) {n = 1, . • . , 9899) for the sequence Xn+i == 
4xn(l - Xn) and yn = F{xn), along with the (marginal) histograms of yn (on top) 
and yn+ioo (right margin). 



histogram of the transformed variables Yn — 0.5 + arcsin(Xn)/7r, of a sam- 
ple of successive values = Doc{Xn) fits the uniform density extremely 

well. Moreover, while the plots of (Fn,^n+i) and (Fn,^n+io) do not display 
characteristics of uniformity, Figure 2.1 shows that the sample of {Yn-, ^n+ioo) 
satisfactorily fills the unit square. However, even when these functions give 
a good approximation of randomness in the unit square [0,1] x [0,1], the 
hypothesis of randomness is rejected by many standard tests. 

Classic examples from the theory of chaotic functions do not lead to ac- 
ceptable pseudo-random number generators. Moreover, the 100 calls to 
between two generations are excessive in terms of computing time. || 

We have presented in this introduction some necessary basic notions to now 
understand a very good pseudo-random number generator, the algorithm Kiss^ 
of Marsaglia and Zaman (1993). However, many of the details involve notions 
that are a bit tangential to the main topic of this text and, in addition, most 
computer packages now include a well-behaved uniform random generator. 
Thus, we leave the details of the Kiss generator to Note 2.6.1. 



2.1.2 The Inverse Transform 

In describing the structure of a space of random variables, it is always possible 
to represent the generic probability triple (1?, P) (where Q represents the 
whole space, J~ represents a cr-algebra on i7, and P is a probability measure) as 

^ The name is an acronym of the saying Keep it simple, stupid!, and not reflective 
of more romantic notions. After all, this is a Statistics text! 
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([0, 1], 6,Z//[o,i]) (where B are the Borel sets on [0, 1]) and therefore equate the 
variability of G i? with that of a uniform variable in [0, 1] (see, for instance, 
Billingsley 1995, Section 2). The random variables X are then functions from 
[0, 1] to A', that is, functions of uniform variates transformed by the generalized 
inverse function. 

Definition 2.3. For a non-decreasing function F on R, the generalized inverse 
of F, F“, is the function defined by 

(2.1) F~{u) — inf{x : F{x) > u} . 

We then have the following lemma, sometimes known as the probability inte- 
gral transform^ which gives us a representation of any random variable as a 
transform of a uniform random variable. 

Lemma 2.4. IfU^ ^o,i]; ihen the random variable F~{U) has the distribu- 
tion F. 



Proof. For all u G [0,1] and for all x G F“([0, 1]), the generalized inverse 
satisfies 

F{F~{u)) > u and F~{F{x)) < x . 

Therefore, 

{{u^x) : F~{u) < x} = {{u^x) : F{x) > u} 

and 

P{F~{U) <x) = P{U < F{x)) = F{x) . 



□ 

Thus, formally, in order to generate a random variable X ~ F, it suffices to 
generate U according to ^o,l] then make the transformation x = F~{u). 

Example 2.5. Exponential variable generation. If X ~ Sxp{l), so 
F(x) = 1 — e“^, then solving for x in — 1 — e~^ gives x = — log(l — u). 
Therefore, iiU ^ ^o,l] ? l^he random variable X — — log U has the exponential 
distribution (as U and 1 — U are both uniform). || 

The generation of uniform random variables is therefore a key determinant 
in the behavior of simulation methods for other probability distributions, since 
those distributions can be represented as a deterministic transformation of 
uniform random variables. (Although, in practice, we often use methods other 
than that of Lemma 2.4, this basic representation is usually a good way to 
think about things. Note also that Lemma 2.4 implies that a bad choice of 
a uniform random number generator can invalidate the resulting simulation 
procedure.) 
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As mentioned above, from a theoretical point of view an operational ver- 
sion of any probability space (i?, P) can be created from the uniform distri- 
bution ^0,1] Lemma 2.4. Thus, the generation of any sequence of random 
variables can be formally implemented through the uniform generator Kiss. 
In practice, however, this approach only applies when the cumulative distri- 
bution functions are “explicitly” available, in the sense that there exists an 
algorithm allowing the computation of F~ {u) in acceptable time. In particu- 
lar, for distributions with explicit forms of F~ (for instance, the exponential, 
double-exponential, or Weibull distributions; see Problem 2.5 for other exam- 
ples), Lemma 2.4 does lead to a practical implementation. But this situation 
only covers a small number of cases, described in Section 2.2 and additional 
problems. Other methods, like the Accept-Reject method of Section 2.3, are 
more general and do not use any strong analytic property of the densities. 
Thus, they can handle more general cases as, for example, the simulation of 
distributions in dimensions greater than one. 



2.1.3 Alternatives 



Although computation by Monte Carlo methods can be thought of as an exact 
calculation (as the order of accuracy is only a function of computation time) , 
it is probably more often thought of as an approximation. Thus, numerical 
approximation is an alternative to Monte Carlo, and should also be consid- 
ered a candidate for solving any particular problem. The following example 
shows how numerical approximations can work in the calculation of normal 
probabilities (see Sections 1.4, 3.6.2 and 3.4 for other approaches). 

Example 2.6. Normal probabilities. Although the cumulative distri- 
bution function of the normal distribution cannot be expressed explicitly, since 

^{x) = J eyi^{-z^/2}dz, 

there exist approximations of ^ and of up to an arbitrary precision. For 
instance, Abramowitz and Stegun (1964) give the approximation 

^{x) (f{x) [bit + b2t‘^ + bst^ F b4t^ -h b^t^] (x > 0) , 

where (p denotes the normal density, ^ = (1 + px)~^ and 



p = 0.2316419, bi = 0.31938, b‘2 = -0.35656, 

bs = 1.78148, 54 = -1.82125, b^ = 1.33027. 



Similarly, we also have the approximation 






do F dit 

1 F b\t F b2F ’ 



where = log(a ^) and 
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ao = 2.30753, ai = 0.27061, bi = 0.99229, 62 = 0.04481. 

These two approximations are exact up to an error of order 10“®, the error 
being absolute. If no other fast simulation method was available, this approx- 
imation could be used in settings which do not require much precision in the 
tails of A/’(0, 1). (However, as shown in Example 2.8, there exists an exact and 
much faster algorithm.) || 



2.1.4 Optimal Algorithms 

Devroye 1985 presents a more comprehensive (one could say almost exhaus- 
tive!) treatment of the methods of random variable generation than the one 
presented in this chapter, in particular looking at refinements of existing al- 
gorithms in order to achieve uniformly optimal performances. (We strongly 
urge the reader to consult this book"^ for a better insight on the implications 
of this goal in terms of probabilistic and algorithmic complexity.) 

Some refinements of the simulation techniques introduced in this chapter 
will be explored in Chapter 4, where we consider ways to accelerate Monte 
Carlo methods. At this point, we note that the concepts of “optimal” and 
“efficient” algorithms are particularly difficult to formalize. We can naively 
compare two algorithms, [Bi] and [B 2 ] say, in terms of time of computation, 
for instance through the average generation time of one observation. However, 
such a comparison depends on many subjective factors like the quality of the 
programming, the particular programming language used to implement the 
method, and the particular machine on which the program runs. More impor- 
tantly, it does not take into account the conception and programming (and 
debugging) times, nor does it incorporate the specific use of the sample pro- 
duced, partly because a quantification of these factors is generally impossible. 
For instance, some algorithms have a decreasing efficiency when the sample 
size increases. The reduction of the efficiency of a given algorithm to its av- 
erage computation time is therefore misleading and we only use this type of 
measurement in settings where [Bi] and [B 2 ] are already of the same com- 
plexity. Devroye (1985) also notes that the simplicity of algorithms should 
be accounted for in their evaluation, since complex algorithms facilitate pro- 
gramming errors and, therefore, may lead to important time losses.^ 

A last remark to bring this section to its end is that simulation of the stan- 
dard distributions presented here is accomplished quite efficiently by many 
statistical programming packages (for instance. Gauss, Mathematica, Matlab, 
R, Spins). When the generators from these general-purpose packages are eas- 
ily accessible (in terms of programming), it is probably preferable to use such 

^ The book is now out-of-print but available for free on the author’s website, at 
McGill University, Montreal, Canada. 

^ In fact, in numerous settings, the time required by a simulation is overwhelmingly 
dedicated to programming. This is, at least, the case for the authors themselves!!! 
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a generator rather than to write one’s own. However, if a generation technique 
will get extensive use or if there are particular features of a problem that can 
be exploited, the creation of a personal library of random variable generators 
can accelerate analyses and even improve results, especially if the setting in- 
volves “extreme” cases (sample size, parameter values, correlation structure, 
rare events) for which the usual generators are poorly adapted. The invest- 
ment represented by the creation and validation of such a personal library 
must therefore be weighed against the potential benefits. 



2.2 General Transformation Methods 

When a distribution / is linked in a relatively simple way to another dis- 
tribution that is easy to simulate, this relationship can often be exploited 
to construct an algorithm to simulate variables from /. In this section we 
present alternative (to Lemma 2.4) techniques for generating nonuniform ran- 
dom variables. Some of these methods are rather case-specific, and are difficult 
to generalize as they rely on properties of the distribution under consideration 
and its relation with other probability distributions. 

We begin with an illustration of some distributions that are simple to 
generate. 

Example 2.7. Building on exponential random variables. In Example 
2.5 we saw how to generate an exponential random variable starting from a 
uniform. Now we illustrate some of the random variables that can be generated 
starting from an exponential distribution. If the X^’s are iid £xp{l) random 
variables, then 



(2.2) 



y 

y 

y 



J=1 



pJ2Xj--Qa{a,(3) 

J = 1 






He(a, b ) , 



a G N*, 
a,6GN*. 



Other derivations are possible (see Problem 2.6). 



These transformations are quite simple to use and, hence, will often be a 
favorite. However, there are limits to their usefulness, both in scope of vari- 
ables that can be generated and in efficiency of generation. For example, as 
we will see, there are more efficient algorithms for Gamma and Beta random 
variables. Also, we cannot use exponentials to generate Gamma random vari- 
ables with a non-integer shape parameter. For instance, we cannot get a Xi 
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variable, which would, in turn, get us a jV( 0, 1) variable. For that, we look at 
the following example of the Box-Muller algorithm (1958) for the generation 
of A/*(0, 1) variables. 

Example 2.8. Normal variable generation. If r and 6 are the polar co- 
ordinates of (Xi,X 2 ), then, since the distribution of (Xi,X 2 ) is rotation in- 
variant (see Problem 2.7) 

r2 = x2 + X|~X2=<^^P(l/2), 

6 ~ ZY[0,27t] • 

If Ui and U 2 are iid Z^[o,i]’ variables Xi and X 2 defined by 

= V-21og(?7i) cos(27Tf/2) , X 2 = V-21og(?7i) sin(27rt/2) , 

are then iid A/’(0, 1). The corresponding algorithm is 
Algorithm A. 3 -Box-Muller- 

1 Generate Uj,U 2 iid W[o,i] ; 

2 Define [A.3] 

{ Xi = y^”21og(tii) cos(2x«2) , 

X 2 = ^-2 log(iii) sin(27ru2) ; 

3 Take X\ and X 2 as two independent draws from A/’tO, 1). 

In comparison with algorithms based on the Central Limit Theorem, this al- 
gorithm is exact, producing two normal random variables from two uniform 
random variables, the only drawback (in speed) being the necessity of calcu- 
lating functions such as log, cos, and sin. If this is a concern, Devroye (1985) 
gives faster alternatives that avoid the use of these functions (see also Prob- 
lems 2.8 and 2.9). || 



Example 2.9. Poisson generation. The Poisson distribution is connected 
to the exponential distribution through the Poisson process; that is, if TV ~ 
P{X) and Xi ~ Sxp{\)^ T G N*, then 



P\{N — k) — P\(Xi + •••-!- Xk < 1 < X\ + • • • + Xk-\-i) . 



Thus, the Poisson distribution can be simulated by generating exponential 
random variables until their sum exceeds 1. This method is simple, but is really 
practical only for smaller values of A. On average, the number of exponential 
variables required is A, and this could be prohibitive for large values of A. In 
these settings, Devroye (1981) proposed a method whose computation time 
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is uniformly bounded (in A) and we will see another approach, suitable for 
large A’s, in Example 2.23. Note also that a generator of Poisson random 
variables can produce negative binomial random variables since, when V ~ 
^a(n, (1 — p)/p) and X\y ~ V{y), X ~ Meg{n,p). (See Problem 2.13.) |1 

Example 2.9 shows a specific algorithm for the generation of Poisson ran- 
dom variables. Based on an application of Lemma 2.4, we can also construct 
a generic algorithm that will work for any discrete distribution. 

Example 2.10. Discrete random variables. To generate X ^ Pq, we can 

calculate (once for all) the probabilities 

Po - Pe{X < 0), Pi = Pe{X < 1), p2 = Pe{X < 2), ... 

then generate U ~ ^o,i] 



X = A: if pk-i <U <pk. 

For example, to generate X ~ Sm(10, .3), the first values are 

Po = 0.028, Pi = 0.149, P2 = 0.382, . . . ,pio = 1 , 

and to generate X ^V{7), take 

Po = 0.0009, Pi = 0.0073, p2 = 0.0296, . . . 

the sequence being stopped when it reaches 1 with a given number of decimals. 
(For instance, p 20 = 0.999985.) Specific algorithms, such as Example 2.9, are 
usually more efficient but it is mostly because of the storage problem. See 
Problem 2.12 and Devroye (1985). || 



Example 2.11. Beta generation. Consider t/i, . . . , an iid sample from 
Z7[o,i]- If ^( 1 ) < ••• < U(^ri) denotes the ordered sample, that is, the order 
statistics of the original sample, is distributed as Be(i, n — i + 1) and the 
vector of the differences (/7(^^), ^ I “ ^(u)) ^ 

Dirichlet distribution T>{ii,i 2 — hi . . . , n — i^ + l) (see Problem 2.17). However, 
even though these probabilistic properties allow the direct generation of Beta 
and Dirichlet random variables from uniform random variables, they do not 
yield efficient algorithms. The calculation of the order statistics can, indeed, be 
quite time-consuming since it requires sorting the original sample. Moreover, 
it only applies for integer parameters in the Beta distribution. 

The following result allows for an alternative generation of Beta random 
variables from uniform random variables: Johnk’s Theorem (see Johnk 1964 
or Devroye 1985) states that if U and V are iid ld[o,i]^ the distribution of 

pljoc 



IJl/oc yl/P’ 
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conditional on < 1, is the Be{a,P) distribution. However, given 

the constraint on this result does not provide a good algorithm 

to generate Be(a,/3) random variables for large values of a and (3^ as shown 
by the fast decrease of the probability of accepting a pair {U, V) as a function 
of a = /? in Figure 2.2. || 




Fig. 2.2. Probability of accepting a pair (U,V) in Johnk (1964) algorithm as a 
function of a, when a = /3. 



Example 2.12. Gamma generation. Given a generator of Beta random 
variables, we can derive a generator of Gamma random variables ^a(a, 1) 
(a < 1) the following way: If F ~ Se(a, 1 — a) and Z ~ 5xp(l), then X = 
YZ Qa{a^l). Indeed, by making the transformation x = yz^w = z and 
integrating the joint density, we find 



r{a)r{l-a)J^ IwJ V w 



(2.3) 



r{a)r{i 

1 



- 

1 W 



'^dw 






Alternatively, if we can start with a Gamma random variable, a more efficient 
generator for Qa{a, 1) {a < 1) can be constructed: If F ~ Qa{a + 1, 1) and 
U ~ ^ 0 , 1 ] 5 independent, then X = is distributed according to Qaipc^ 1), 

since 



(2.4) 




w~^dw - . 



(See Stuart 1962 or Problem 2.14). 



The representation of a probability density as in (2.3) is a particular case 
of a mixture of distributions. Not only does such a representation induce rela- 
tively efficient simulation methods, but it is also related to methods in Chap- 
ters 9 and 10. The principle of a mixture representation is to write a density 
/ as the marginal of another distribution, in the form 
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(2.5) /(x) = / g{x,y)dy or /(s) = V Pi /i(a;) , 

iey 

depending on whether 3^ is continuous or discrete. For instance, if the joint 
distribution g{x, y) is simple to simulate, then the variable X can be obtained 
as a component of the generated (X, Y). Alternatively, if the component dis- 
tributions fi{x) can be easily generated, X can be obtained by first choosing 
fi with probability pi and then generating an observation from fi. 

Example 2.13. Student’s t generation. A useful form of (2.5) is 

(2.6) f(x)=[ g{x,y)dy=f hi{x\y)h 2 {y) dy , 

jy Jy 

where hi and /12 are the conditional and marginal densities of X|y = y and 
y, respectively. For example, we can write Student’s t density with u degrees 
of freedom in this form, where 

X|p ~7V(0,i//p) and Y xl- 

Such a representation is also useful for discrete distributions. In Example 2.9, 
we noted an alternate representation for the negative binomial distribution. 
If X is negative binomial, X ~ J\feg{n,p), then P{X = x) can be written as 

(2.6) with 

X\y^V{y) and y^^(n,/?), 

where (3 = {l—p)/p. Note that the discreteness of the negative binomial distri- 
bution does not result in a discrete mixture representation of the probability. 
The mixture is continuous, as the distribution of Y is itself continuous. || 



Example 2.14. Noncentral chi squared generation. The noncentral chi 
squared distribution, also allows for a mixture representation, since it 

can be written as a sum of central chi squared densities. In fact, it is of the 
form (2.6) with h\ the density of a distribution and /12 the density 

of V{\/2). However, this representation is not as efficient as the algorithm 
obtained by generating ^ ~ Xp-i and Y A/’(a/A, 1), and using the fact 
that Z -f y^ ~ XpW‘ Note that the noncentral chi squared distribution does 
not have an explicit form for its density function. It is either represented as 
an infinite mixture (see (3.31)) or by using modified Bessel functions (see 
Problem 1.8). || 

In addition to the above two examples, other distributions can be repre- 
sented as mixtures (see, for instance, Gleser 1989). In many cases this represen- 
tation can be exploited to produce algorithms for random variable generation 
(see Problems 2.24-2.26, and Note 2.6.3). 
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2.3 Accept-Reject Methods 

There are many distributions from which it is difficult, or even impossible, 
to directly simulate by an inverse transform. Moreover, in some cases, we are 
not even able to represent the distribution in a usable form, such as a trans- 
formation or a mixture. In such settings, it is impossible to exploit direct 
probabilistic properties to derive a simulation method. We thus turn to an- 
other class of methods that only requires us to know the functional form of the 
density / of interest up to a multiplicative constant; no deep analytical study 
of / is necessary. The key to this method is to use a simpler (simulationwise) 
density g from which the simulation is actually done. For a given density g — 
called the instrumental density — there are thus many densities / — called the 
target densities — which can be simulated this way. The corresponding algo- 
rithm, called Accept-Reject^ is based on a simple connection with the uniform 
distribution, discussed below. 

2.3.1 The Fundamental Theorem of Simulation 

There exists a fundamental (simple!) idea that underlies the Accept-Reject 
methodology, and also plays a key role in the construction of the slice sampler 
(Chapter 8). If / is the density of interest, on an arbitrary space, we can write 

rf(^) 

(2.7) f{x)= du. 

Jo 

Thus, / appears as the marginal density (in X) of the joint distribution, 

(2.8) (X, U)r^U{{x,u):0<u< f{x)} . 

Since U is not directly related to the original problem, it is called an auxiliary 
variable^ a notion to be found again in later chapters like Chapters 8-10. 

Although it seems like we have not gained much, the introduction of the 
auxiliary uniform variable in (2.7) has brought a considerably different per- 
spective: Since (2.8) is the joint density of X and C/, we can generate from 
this joint distribution by just generating uniform random variables on the con- 
strained set {{x,u) : 0 < u < f{x)}. Moreover, since the marginal distribution 
of X is the original target distribution, /, by generating a uniform variable 
on {(a:, li) :0 < u < /(x)}, we have generated a random variable from /. And 
this generation was produced without using / other than through the calcu- 
lation of f{x)\ The importance of this equivalence is stressed in the following 
theorem: 

Theorem 2.15 (Fundamental Theorem of Simulation). Simulating 

X ~ fix) 

is equivalent to simulating 

(X, U)r^U{{x,u):0<u< fix)} . 
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While this theorem is fundamental in many respects, it appears mostly as a 
formal representation at this stage because the simulation of the uniform pair 
(X, U) is often not straightforward. For example, we could simulate X fix) 
and U\X = x ^ ZY(0, /(x)), but then this makes the whole representation 
useless. And the symmetric approach, which is to simulate U from its marginal 
distribution, and then X from the distribution conditional on U = u, does 
not often result in a feasible calculation. The solution is to simulate the entire 
pair (X, U) at once in a bigger set, where simulation is easier, and then take 
the pair if the constraint is satisfied. 

For example, in a one-dimensional setting, suppose that 

b 

f{x)dx — 1 

and that / is bounded by m. We can then simulate the random pair (V, U) ~ 
U{0 < u < m) by simulating Y ~ U{a,h) and U\Y — y ^ ^(0,m), and take 
the pair only if the further constraint 0 < u < f{y) is satisfied. This results 
in the correct distribution of the accepted value of V, call it X, because 



(2.9) 



P{X <x) = P{Y < x\U < f{Y)) 



fa fo^^^ du dy 



la du dy 






f{y) dy. 



This amounts to saying that, A C B and if we generate a uniform sample 
on B, keeping only the terms of this sample that are in A will result in a 
uniform sample on A (with a random size that is independent of the values 
of the sample). 



Example 2 . 16 . Beta simulation. We have seen (Example 2.11) that direct 
simulation of Beta random variables can be difficult. However, we can easily 
use Theorem 2.15 for this simulation when a > 1 and /? > 1. Indeed, to 
generate X ~ Be(a,/^), we take Y ~ ^o,i] U ~ 2Y[o,m]5 where m is the 
maximum of the Beta density (Problem 2.15). For a = 2.7 and (3 = 6.3 Figure 
2.3 shows the results of generating 1000 pairs (V, U). The pairs that fall under 
the density function are those for which we accept X = T , and we reject those 
pairs that fall outside. || 



In addition, it is easy to see that the probability of acceptance of a given 
simulation in the box [a, 5] x [0, m] is given by 

1 /•! rf{y) I 

P(Accept) = P{U < f{Y)) = - / dudy=-. 

Jo Jo ^ 

For Example 2.16, m = 2.67, so we accept approximately 1/2.67 == 37% of 
the values. 
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Fig. 2.3. Generation of Beta random variables: Using Theorem 2.15, 1000 {Y,U) 
pairs were generated, and 365 were accepted (the black circles under the Beta 
Be(2.7, 6.3) density function). 



The argument leading to (2.9) can easily be generalized to the situation 
where the larger set is not a box any longer, as long as simulating uniformly 
over this larger set is feasible. This generalization may then allow for cases 
where either or both of the support of / and the maximum of / are unbounded. 
If the larger set is of the form 



jSf = {{y,u) :0 <u< m{y)} , 

the constraints are thus that m{x) > f{x) and that simulation of a uniform 
on Jjf is feasible. Obviously, efficiency dictates that m be as close as possible 
to / in order to avoid wasting simulations. A remark of importance is that, 
because of the constraint m{x) > /(x), m cannot be a probability density. We 
then write 

m{x) = Mg{x) where / m{x) dx= Mg{x) dx = M , 

Jx JX 

since m is necessarily integrable (otherwise, Jif would not have finite mass and 
a uniform distribution would not exist on JSf). As mentioned above, a natural 
way of simulating the uniform on Jjf is then to use (2.7) backwards, that is, 
to simulate Y g and then U\Y = y ^ U{0, Mg{y)). If we only accept the 
2 /’s such that the constraint u < f{y) is satisfied, we have 



P{X eA) = P{Y e A\U < f{Y)) 
^ 9{y)dy 




for every measurable set A and the accepted X’s are indeed distributed from 
/. We have thus derived a more general implementation of the fundamental 
theorem, as follows: 
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Fig. 2.4. Plot of a uniform sample over the set {(x,u) : 0 < u < f{x)} for 
f{x) oc exp(— x^/2)(sin(6a:)^ + 3 cos(a:)^ sin(4x)^ + 1) and of the envelope function 
g{x) = 5exp(-x^/2). 



Corollary 2.17. Let X ~ /(x) and let g{x) he a density function that satisfies 
f{x) < Mg{x) for some constant M > 1. Then, to simulate X ^ f, it is 
sufficient to generate 

Y g and U\Y = y -- U{0, Mg{y)) , 
until 0 < u < f{y). 

Figure 2.4 illustrates Corollary 2.17 for the target density 

f{x) oc exp(— x^/2)(sin(6x)^ + 3cos(x)^ sin(4x)^ + 1) 
with upper bound (or, rather, dominating density) the normal density 

g{x) = exp(— x^/2)/V^ , 

which is obviously straightforward to generate. 

Corollary 2.17 has two consequences. First, it provides a generic method to 
simulate from any density / that is known up to a multiplicative factor, that 
is, the normalizing constant of / need not be known, since the method only 
requires input of the ratio //M, which does not depend on the normalizing 
constant. This is for instance the case of Figure 2.4, where the normalizing 
constant of / is unknown. This property is particularly important in Bayesian 
calculations. There, a quantity of interest is the posterior distribution, defined 
according to Bayes Theorem by 



(2.10) 



7t{0\x) oc 7t{0) f{x\0) . 
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Thus, the posterior density 'k{6\x) is easily specified up to a normalizing con- 
stant and, to use Corollary 2.17, this constant need not be calculated. (See 
Problem 2.29.) 

Of course, there remains the task of finding a density g satisfying / < Mg^ 
a bound that need not be tight, in the sense that Corollary 2.17 remains valid 
when M is replaced with any larger constant. (See Problem 2.30.) 

A second consequence of Corollary 2.17 is that the probability of accep- 
tance is exactly 1/M (a geometric waiting time), when evaluated for the prop- 
erly normalized densities, and the expected number of trials until a variable is 
accepted is M (see Problem 2.30). Thus, a comparison between different sim- 
ulations based on different instrumental densities ^i, ^ 2 , • • • can be undertaken 
through the comparison of the respective bounds Mi, M 2 , ... (as long as the 
corresponding densities ^ 1 ,^ 2 ,- •• are correctly normalized). In particular, a 
first method of optimizing the choice of ^ in ^ 1 , ^ 2 , • • • is to find the small- 
est bound Mi. However, this first and rudimentary comparison technique has 
some limitations, which we will see later in this chapter. 



2.3.2 The Accept-Reject Algorithm 

The implementation of Corollary 2.17 is known as the Accept-Reject method^ 
which is usually stated in the slightly modified, but equivalent form. (See 
Problem 2.28 for extensions.) 

Algorithm A.4 -Accept-Reject Method- 

1. Generate ! 

2. Accept y = X if U <f(x)/Mg{X) ; [A.4] 

3. Return to 1. otherwise. 



In cases where / and g are normalized so they are both probability densi- 
ties, the constant M is necessarily larger than 1. Therefore, the size of M, and 
thus the efficiency of [A.4], becomes a function of how closely g can imitate /, 
especially in the tails of the distribution. Note that for //^ to remain bounded, 
it is necessary for g to have tails thicker than those of /. It is therefore im- 
possible for instance to use [A.4] to simulate a Cauchy distribution / using 
a normal distribution g; however, the reverse works quite well. (See Problem 
2.34.) Interestingly enough, the opposite case when g/f is bounded can also 
be processed by a tailored Markov chain Monte Carlo algorithm derived from 
Doukhan et al. (1994) (see Problems 7.5 and 7.6). 

A limited optimization of the Accept-Reject algorithm is possible by 
choosing the instrumental density ^ in a parametric family, and then deter- 
mining the value of the parameter which minimizes the bound M. A similar 
comparison between two parametric families is much more delicate since it 
is then necessary to take into account the computation time of one gener- 
ation from g in [A.4]. In fact, pushing the reasoning to the limit, if ^ / 
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and if we simulate X ~ / by numerical inversion of the distribution function, 
we formally achieve the minimal bound M = 1, but this does not guarantee 
that we have an efficient algorithm, as can be seen in the case of the normal 
distribution. 

Example 2.18. Normals from double exponentials. Consider generating 
a A/”(0, 1) by [^.4] using a double-exponential distribution £(o), with density 
g{x\a) = (a/2) exp(— a|x|). It is then straightforward to show that 

g{x\a) 

and that the minimum of this bound (in a) is attained for a == 1. The prob- 
ability of acceptance is then yJ'Kj2e = .76, which shows that to produce one 
normal random variable, this Accept-Reject algorithm requires on the average 
1/.76 1.3 uniform variables, to be compared with the fixed single uniform 

required by the Box-Muller algorithm. || 



A real advantage of the Accept-Reject algorithm is illustrated in the fol- 
lowing example. 

Example 2.19. Gamma Accept-Reject We saw in Example 2.7 that if 
a G N, the Gamma distribution ^a(a, (3) can be represented as the sum of a 
exponential random variables ~ £xp{P)^ which are very easy to simulate, 
since = — log{Ui)/P, with Ui ~ ZY([0, 1]). In more general cases (for example 
when a 0 N), this representation does not hold. 

A possible approach is to use the Accept-Reject algorithm with instrumen- 
tal distribution ^a(a, 5), with a = [a] (a > 1). (Without loss of generality, 
suppose j3 = 1.) The ratio f /g is exp{ — (1 — 5)x}, up to a normalizing 

constant, yielding the bound 



M = b~^ 



a — a \ 



for b < 1. Since the maximum of 5 ^(1 — 5)^ ^ is attained at b = a/a, the 
optimal choice of b for simulating ^a(a,l) is b = a/a, which gives the same 
mean for 5a(a, 1) and b). (See Problem 2.31.) || 



It may also happen that the complexity of the optimization is very expen- 
sive in terms of analysis or of computing time. In the first case, the construc- 
tion of the optimal algorithm should still be undertaken when the algorithm is 
to be subjected to intensive use. In the second case, it is most often preferable 
to explore the use of another family of instrumental distributions g. 

Example 2.20. Truncated normal distributions. Truncated normal dis- 
tributions appear in many contexts, such as in the discussion after Example 
1.5. When constraints x > jjL produce densities proportional to 
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for a bound fi large compared with /x, there are alternatives which are far 
superior to the naive method in which a A/^(/x,cr^) distribution is simulated 
until the generated value is larger than /x. (This approach requires an average 
number of l/^((/x — /x)/cr) simulations from A/’(/x, cr^) for one acceptance.) 
Consider, without loss of generality, the case /x = 0 and a = 1. A potential 
instrumental distribution is the translated exponential distribution, £xp{a, /x), 
with density 

g^{z) = . 

The ratio f /ga{z) = is then bounded by exp(a^/2 — ag) if 

a > fi and by exp(— /x^/2) otherwise. The corresponding (upper) bound is 

{ 1/a exp(a^/2 — a^) if a > ^, 

1/a exp(— /x^/2) otherwise. 

The first expression is minimized by 

(2.11) a* = -j- 4 , 

whereas a = p minimizes the second bound. The optimal choice of a is there- 
fore (2.11), which requires the computation of the square root of /x^+4. Robert 
(1995b) proposes a similar algorithm for the case where the normal distribu- 
tion is restricted to the interval [/x,7i]- For some values of [)tx,/x], the optimal 
algorithm is associated with a value of a, which is a solution to an implicit 
equation. (See also Geweke 1991 for a similar resolution of this simulation 
problem and Marsaglia 1964 for an earlier solution.) || 

One criticism of the Accept-Reject algorithm is that it generates “useless” 
simulations when rejecting. We will see in Chapter 3 how the method of 
importance sampling (Section 3.3) can be used to bypass this problem and 
also how both methods can be compared. 



2.4 Envelope Accept-Reject Methods 

2.4.1 The Squeeze Principle 

In numerous settings, the distribution associated with the density / is dif- 
ficult to simulate because of the complexity of the function / itself, which 
may require substantial computing time at each evaluation. In the setup of 
Example 1.9 for instance, if a Bayesian approach is taken with 6 distributed 
(a posteriori) as 
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( 2 . 12 ) Yl 

i=l 

where a is known, each single evaluation of 'k{6\x) involves the computation of 
n terms in the product. It turns out that an acceleration of the simulation of 
densities such as (2.12) can be accomplished by an algorithm that is “one step 
beyond” the Accept-Reject algorithm. This algorithm is an envelope algorithm 
and relies on the evaluation of a simpler function gi which bounds the target 
density / from below. The algorithm is based on the following extension of 
Corollary 2.17 (see Problems 2.35 and 2.36). 

Lemma 2.21. If there exist a density pm, cl function gi and a constant M 
such that 

gi{x) < f{x) < Mgm{x) , 

then the algorithm 

Algorithm A. 5 -Envelope Accept-Reject- 

1 . Generate X ^ grn{^) * ^ ^ * 

2. Accept X if U < gi{X)/Mg„^{X); [A. 5] 

3. otherwise^ accept X if U < f(X)/Mgni{X) 

produces random variables that are distributed according to /. 

By the construction of a lower envelope on /, based on the function gi, 
the number of evaluations of / is potentially decreased by a factor 

^ j gi{x)dx, 

which is the probability that / is not evaluated. This method is called the 
squeeze principle by Marsaglia (1977) and the ARS algorithm [A.7] in Section 
2.4 is based on it. A possible way of deriving the bounds gi and Mg^ is to 
use a Taylor expansion of f{x). 

Example 2.22. Lower bound for normal generation. It follows from 
the Taylor series expansion of exp(-x^/2) that exp(-x^/2) > 1 - {x^ /2), and 
hence 




which can be interpreted as a lower bound for the simulation of A7(0, 1). 
This bound is obviously useless when \X\ < \/2, an event which occurs with 
probability 0.61 for X rsj C(0,1). II 



1-h 



{xj - o y 

pcr‘^ 



21 — 2 - 
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Example 2.23. Poisson variables from logistic variables. As indicated 
in Example 2.9, the simulation of the Poisson distribution V{X) using a Poisson 
process and exponential variables can be rather inefficient. Here, we describe 
a simpler alternative of Atkinson (1979), who uses the relationship between 
the Poisson V{X) distribution and the logistic distribution. The logistic distri- 
bution has density and distribution function 



^ 1 exp{-{x - a)/P} 

(3 [l + exp{-(x-o:)//3}]2 



and F{x) 



1 

1 + exp{-(x - a)/0] 



and is therefore analytically invertible. 

To better relate the continuous and discrete distributions, we consider 
N = [x 0.5J, the integer part of x + 0.5. Also, the range of the logistic 
distribution is (— oo, oo), but to better match it with the Poisson, we restrict 
the range to [—1/2, oo). Thus, the random variable N has distribution function 

P{N — n) — ^ ^ ^_(^^0.5-a)//3 I g-(n-0.5-Q!)//3 



if X > 1/2 and 



P(N = n) = 



( ^ 



1 

I _|_ ^-(n-0.5-a)/p 



) 



1 + g-(0.5+a)//3 
g-(0.5+a)//3 



if— l/2<x<l/2 and the ratio of the densities is 
(2.13) X^/P{N = n)e^n\ . 

Although it is difficult to compute a bound on (2.13) and, hence, to optimize 
it in (a, /?), Atkinson (1979) proposed the choice a = X and (3 = tt/v^. This 
identifies the two first moments of X with those of V{X). For this choice of a 
and /?, analytic optimization of the bound on (2.13) remains impossible, but 
numerical maximization and interpolation yields the bound c = 0. 767— 3.36/A. 
The resulting algorithm is then 



Algorithm A. 6 —Atkinson’s Poisson Simulation- 

0. Define ^ = 7 t/\/^, a = A/5 emd /I* = logc— A — log/J; 

1 . Generate U\ ^ W[o,i] calculate 

X = {q - log{{l - ui)/wi}}//3 
until A" > -0.5 ; 

2. Define N ■= [X + 0.5J and generate U 2 ^W[o,i)S 

3. Accept iV^P(A) if [A.6] 

a — /3x + log (u 2 /{l + exp{a - jdx)}^) <k + iVlog A - log A! . 
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Although the resulting simulation is exact, this algorithm is based on a 
number of approximations, both through the choice of (a, /?) and in the com- 
putation of the majorization bounds and the density ratios. Moreover, note 
that it requires the computation of factorials, AT!, which may be quite time- 
consuming. Therefore, although [A. 6] usually has a reasonable efficiency, more 
complex algorithms such as those of Devroye (1985) may be preferable. || 



2.4.2 Log-Concave Densities 

The particular case of log- concave densities (that is, densities whose logarithm 
is concave) allows the construction of a generic algorithm that can be quite 
efficient. 



Example 2.24. Log-concave densities. Recall the exponential family 
(1.9) 

fix) = h{x) e,xeR'^. 

This density is log-concave if 



^log/W = ^log/.W 



h{x)h"{x) — [h'{x)]‘^ 

h^{x) 



< 0 , 



which will often be the case for the exponential family. For example, if X ~ 
A/*(^, 1), then h(x) oc exp{— x^/2} and \ogh{x) / dx‘^ = —1. See Problems 
2.40-2.42 for properties and examples of log-concave densities. || 



Devroye (1985) describes some algorithms that take advantage of the log- 
concavity of the density, but here we present a universal method. The al- 
gorithm, which was proposed by Gilks (1992) and Gilks and Wild (1992), 
is based on the construction of an envelope and the derivation of a corre- 
sponding Accept-Reject algorithm. The method is called adaptive rejection 
sampling (ARS) and it provides a sequential evaluation of lower and upper 
envelopes of the density / when h = log / is concave. 

Let Sn be a set of points Xi^i = 0, 1, . . . , n -h 1, in the support of / such 
that h{xi) = log f{xi) is known up to the same constant. Given the concavity 
of /i, the line Li^i^i through (x^, h{xi)) and (x^+i, /i(xi+i)) is below the graph 
of h in [xi,Xi-^i] and is above this graph outside this interval (see Figure 2.5). 
For X £ [xi,Xi+i], if we define 

hn{x) = mm{Li-i,i{x),Li+i^i+ 2 {x)} and = Li,i+i(a;) , 

the envelopes are 

(2.14) hni^) < ^ hn{x) 

uniformly on the support of /. (We define 
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Fig. 2.5. Lower and upper envelopes of h{x) = log/(x), / a log-concave density 
(Source: Gilks et al. 1995). 



hn{x) = -oo and hn(x) = mm(Lo,i(x), Ln,n-\-i{x)) 

on [xo,Xn+i]^.) Therefore, for f^(x) = exph^(x) and fn(x) = exphn(x), 
(2.14) implies that 



IJx) < f(x) < f^(x) = Wn gn{x) , 

where vjn is the normalized constant of fn\ that is, Qn is a density. The ARS 
algorithm to generate an observation from / is thus 

Algorithm A. 7 -ARS Algorithm- 

0. Initialize n and 

1. Generate U W[0,1] * 

2. If U < l^{X)jwn 9n{^) * accept X; [A.7] 

otherwise, if U < f{X)/xUn 9n{X) ^ accept X 

and update Sn to = Sn U{X). 

An interesting feature of this algorithm is that the set Sn is only updated 
when f(x) has been previously computed. As the algorithm produces variables 
X ~ f{x), the two envelopes / and become increasingly accurate and, 
therefore, we progressively reduce the number of evaluations of /. Note that 
in the initialization of 5n, a necessary condition is that zun < +oo (i.e., that 
Qn is actually a probability density). To achieve this requirement, Lq,i needs 
to have positive slope if the support of / is not bounded on the left and Ln,n+i 
needs to have a negative slope if the support of / is not bounded on the right. 
(See Problem 2.39 for more on simulation from gn>) 





58 



2 Random Variable Generation 



The ARS algorithm is not optimal in the sense that it is often possible to 
devise a better specialized algorithm for a given log-concave density. However, 
although Gilks and Wild (1992) do not provide theoretical evaluations of 
simulation speeds, they mention reasonable performances in the cases they 
consider. Note that, in contrast to the previous algorithms, the function gn is 
updated during the iterations and, therefore, the average computation time 
for one generation from / decreases with n. This feature makes the comparison 
with other approaches quite delicate. 

The major advantage of [A. 7] compared with alternatives is its universal- 
ity. For densities / that are only known through their functional form, the 
ARS algorithm yields an automatic Accept-Reject algorithm that only re- 
quires checking / for log-concavity. Moreover, the set of log-concave densities 
is wide; see Problems 2.40 and 2.41. The ARS algorithm thus allows for the 
generation of samples from distributions that are rarely simulated, without 
requiring the development of case-specific Accept-Reject algorithms. 

Example 2.25. Capture— recapture models. In a heterogeneous capture- 
recapture model (see Seber 1983, 1992 or Borchers et al. 2002), animals are 
captured at time i with probability pi, the size N of the population being 
unknown. The corresponding likelihood is therefore 

^ i=l 

where I is the number of captures, Ui is the number of captured animals during 
the ith capture, and r is the total number of different captured animals. If 
is a priori distributed as a V{X) variable and the p^’s are from a normal 
logistic model, 

(Xi = log ^ ~ 

as in George and Robert (1992), the posterior distribution satisfies 

7r(o!i|A/',ni, . . . ,n/) oc exp jaiTii - + 

If this conditional distribution must be simulated (for reasons which will be 
made clearer in Chapters 9 and 10), the ARS algorithm can be implemented. 
In fact, the log of the posterior distribution 

(2.15) aiTii “ ^ ^ 

is concave in as can be shown by computing the second derivative (see 
also Problem 2.42). 

As an illustration, consider the dataset (ni, . . . , nu) = (32, 20, 8, 5, 1, 2, 0, 
2, 1, 1, 0) which describes the number of recoveries over the years 1957-1968 of 
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Fig. 2.6. Posterior distributions of the capture log-odds ratios for the Northern 
Pintail duck dataset of Johnson and Hoeting (2003) for the years 1957-1965. 



N = 1612 Northern Pintail ducks banded in 1956, as reported in Johnson and 
Hoeting (2003). Figure 2.6 provides the corresponding posterior distributions 
for the first 9 a^’s. The ARS algorithm can then be used independently for 
each of these distributions. For instance, if we take the year 1960, the starting 
points in S can be -10,-6 and —3. The set S then gets updated along 
iterations as in Algorithm [A. 7], which provides a correct simulation from the 
posterior distributions of the a^’s, as illustrated in Figure 2.7 for the year 
1960. II 

The above example also illustrates that checking for log-concavity of a 
Bayesian posterior distribution is straightforward, as log 7 t{ 6 \x) = log 7 t{ 0 ) 
+log /(x|0)+c, where c is a constant (in 6). This implies that the log-concavity 
of 7t{0) and of f{x\6) (in 6) are sufficient to conclude the log-concavity of 
7t{0\x). 

Example 2 . 26 . Poisson regression. Consider a sample (Yi, xi), . . . , (Yn,Xn) 
of integer- valued data Yi with explanatory variable where Yi and Xi are 
connected via a Poisson distribution, 

Yi\xi ~ V{exp{a + bxi}) . 

If the prior distribution of (a, b) is a normal distribution A/*(0, cr^) x A/’(0, r^), 
the posterior distribution of (a, b) is given by 
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Fig. 2.7. Histogram of an ARS sample of 5000 points and corresponding posterior 
distribution of the log-odds ratio aigeo- 



7 r(a, 6|x,y) oc exp < j/i + > e 6 ^/ 2 r^ 

V 2 i i } 

We will see in Chapter 9 that it is often of interest to simulate successively 
the (full) conditional distributions 7r(a, |x, y, 6) and 7 t(6|x, y, a). Since 

log7r(a|x,y,6) = j/i - e“ ^ - a^/2(r^ , 

i i 

log7r(6|x,y,a) = b'^yiXi - -h^ j2T^ , 

i i 

and 

e“ - < 0 , 

i 

e*’** - < 0 , 

i 

the ARS algorithm directly applies for both conditional distributions. 

As an illustration, consider the data in Table 2.1. This rather famous data 
set gives the deaths in the Prussian Army due to kicks from horses, gathered 
by von Bortkiewicz (1898). A question of interest is whether there is a trend 
in the deaths over time. For illustration here, we show how to generate the 
conditional distribution of the intercept, 7r(a|x, y, 5), since the generation of 
the other conditional is quite similar. 

Before implementing the ARS algorithm, we note two simplifying things. 
One, if f{x) is easy to compute (as in this example), there is really no need to 
construct and we just skip that step in Algorithm [A. 7]. Second, we do 

not need to construct the function Qn, we only need to know how to simulate 
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^ log7r(a|x,y,6) = - 
^ log7r(6|x,y,a) = - 
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Table 2.1. Data from the 19*^ century study by Bortkiewicz (1898) of deaths in 
the Prussian army due to horse kicks. The data are the number of deaths in fourteen 
army corps from 1875 to 1894. 



Year 

Deaths 


75 76 77 78 79 80 81 82 83 84 
3 5 7 9 10 18 6 14 11 9 


Year 

Deaths 


85 86 87 88 89 90 91 92 93 94 
5 11 15 6 11 17 12 15 8 4 



from it. To do this, we only need to compute the area of each segment above 
the intervals [xi,Xi^i], 




-0.1 0.0 0.1 0.2 0,3 0,4 



Fig. 2.8. Left panel is the area of integration for the weight of the interval [x 2 , xs]. 
The right panel is the histogram and density of the sample from ^n, with b = .025 
and cr^ = 5. 



The left panel of Figure 2.8 shows the region of the support of f{x) be- 
tween X 2 and xs, with the grey shaded area proportional to the probability of 
selecting the region [x 2 , X 3 ]. If we denote by ai+biX the line through h{xi)) 

and (xi+i, /i(xi+i)), then the area of the grey region of Figure 2.8 is 




(2.16) 
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Thus, to sample from Qn we choose a region [xi^Xi^i] proportional to 
generate U ~ ZY[o,i] then take 



X = Xi + U{xi^i - Xi.) 



The right panel of Figure 2.8 shows the good agreement between the his- 
togram and density of gn> (See Problems 2.37 and 2.38 for generating the Qn 
corresponding to 7t(6|x, y, a), and Problems 9.7 and 9.8 for full Gibbs sam- 
piers.) II 



2.5 Problems 



2.1 Check the uniform random number generator on your computer: 

(a) Generate 1,000 uniform random variables and make a histogram 

(b) Generate uniform random variables (Xi , . . . , Xn ) and plot the pairs {Xi , Vj+i ) 
to check for autocorrelation. 

2.2 (a) Generate a binomial Bin{n,p) random variable with n = 25 and p = .2. 

Make a histogram and compare it to the binomial mass function, and to 
the R binomial generator. 

(b) Generate 5, 000 logarithmic series random variables with mass function 



P(X = x) = x = l,2,... 0<p<l. 

xiogp 



Make a histogram and plot the mass function. 

2.3 In each case generate the random variables and compare to the density function 

(a) Normal random variables using a Cauchy candidate in Accept-Reject; 

(b) Gamma 0a(4.3, 6.2) random variables using a Gamma Ga(4, 7); 

(c) Truncated normal: Standard normal truncated to (2, oo). 

2.4 The arcsine distribution was discussed in Example 2.2. 

(a) Show that the arcsine distribution, with density f{x) = l/7Ty/x{l — x), is 
invariant under the transform y = 1 — x, that is, f{x) = f{y)- 

(b) Show that the uniform distribution 7/[o,i] is invariant under the “tent” trans- 
form, 

[2(1 - a;) iix> 1/2. 



(c) As in Example 2.2, use both the arcsine and “tent” distributions in the 
dynamic system Xn+i = D{Xn) to generate 100 uniform random variables. 
Check the properties with marginal histograms, and plots of the successive 
iterates. 

(d) The tent distribution can have disastrous behavior. Given the finite repre- 
sentation of real numbers in the computer, show that the sequence (Xn) 
will converge to a fixed value, as the tent function progressively eliminates 
the last decimals of Xn- (For example, examine what happens when the 
sequence starts at a value of the form 1/2’^.) 
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2.5 For each of the following distributions, calculate the explicit form of the dis- 
tribution function and show how to implement its generation starting from a 
uniform random variable: (a) exponential; (b) double exponential; (c) Weibull; 
(d) Pareto; (e) Cauchy; (/) extreme value; (g) arcsine. 

2.6 Referring to Example 2.7: 

(a) Show that if C7 ~ ^o,i]? then X = — \ogU/X ~ Sxp{\). 

(b) Verify the distributions in (2.2). 

(c) Show how to generate an Tm,n random variable, where both m and n are 
even integers. 

(d) Show that if C7 ~ ^o,l ]5 then X = log is a Logistic(0, 1) random vari- 
able. Show also how to generate a Logistic(/i, (5) random variable. 

2.7 Establish the properties of the Box-Muller algorithm of Example 2.8. If Ui and 
U 2 are iid ^o,i]? show that: 

(a) The transforms 

Xi = y/-2log{Ui) cos{2ttU2) , X 2 = y/~2\0g{Ui) sin(27r[/2) , 

are iid A7(0, 1). 

(b) The polar coordinates are distributed as 

„2 \r2 . -y2 2 

r — Xi X 2 ^ X 2 j 

6 = arctan^ ^ U[0, 2n], 

X 2 

(c) Establish that exp(— r^/2) ~ Z//[0, 1], and so and 0 can be simulated 
directly. 

2.8 (Continuation of Problem 2.7) 

(a) Show that an alternate version of the Box-Muller algorithm is 
Algorithm A.S -Box-Muller (2)- 



1 . Generate 


[.4.8] 






until S = Ui + C/| < 1 . 

2 . Define Z = ^/~2\og{S}/S and take 




Xi = ZUi, X2 = ZU2. 





{Hint: Show that (Ui,U 2 ) is uniform on the unit sphere and that Xi and 
X 2 are independent.) 

(b) Give the average number of generations in 1 . and compare with the original 
Box-Muller algorithm [A. 3] on a small experiment. 

(c) Examine the effect of not constraining ((7i, C/ 2 ) to the unit circle. 

2.9 Show that the following version of the Box-Muller algorithm produces one 
normal variable and compare the execution time with both versions [A. 3] and 
[^.8]: 
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Algorithm A. 9 -Box-Muller (3)— 



1 . Generate 




Yi,y 2 -£xp(l) 




until Y 2 > {\-Yifj2. 

2. Generate U '^W([0, 1]) and take 


[A.9] 


^ _ I if < 0.5 




\-Yi ift/>0-5. 





2.10 Examine the properties of an algorithm to simulate AT(0, 1) random variables 
based on the Central Limit Theorem, which takes the appropriately adjusted 
mean of a sequence of uniforms Ui,. . . ,Un for n — 12, n — 48 and n = 96. 
Consider, in particular, the moments, ranges, and tail probability calculations 
based on the generated variables. 

2.11 For the generation of a Cauchy random variable, compare the inversion method 
with one based on the generation of the normal pair of the polar Box-Muller 
method (Problem 2.8). 

(a) Show that, if Xi and X 2 are iid normal, Y = X 1 /V 2 is distributed as a 
Cauchy random variable. 

(b) Show that the Cauchy distribution function is F{x) — tan“^(x)/7r, so the 
inversion method is easily implemented. 

(c) Is one of the two algorithms superior? 

2.12 Use the algorithm of Example 2.10 to generate the following random variables. 
In each case make a histogram and compare it to the mass function, and to the 
generator in your computer. 

(a) Binomials and Poisson distributions; 

(b) The hypergeometric distribution; 

(c) The logarithmic series distribution. A random variable X has a logarithmic 
series distribution with parameter p if 

P(x = x) = X = l,2,..., 0<P<1. 

xiogp 

(d) Referring to part (a), for different parameter values, compare the algorithms 
there with those of Problems 2.13 and 2.16. 

2.13 Referring to Example 2.9. 

(a) Show that if AT ~ 'P(A) and Xi ~ 8xp{\), i G N*, independent, then 

Px{N = k) = Px{Xi-^--- + Xk < 1< + + . 

(b) Use the results of part (a) to justify that the following algorithm simulates 
a Poisson V{X) random variable: 

Algorithm A. 10 —Poisson simulation- 



p = 1, N = ^ . 




1 . Repeat 


[21.10] 


A = /V + 1 




generate Ui 




update p = pUi 




until p < c , 




2. Take X = N -1 . 
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{Hint: For part (a), integrate the Gamma density by parts.) 

2.14 There are (at least) two ways to establish (2.4) of Example 2.12: 

(a) Make the transformation x = w = y, and integrate out w. 

(b) Make the transformation x = yz, w = z, and integrate out w. 

2.15 In connection with Example 2.16, for a Beta distribution Be{a,(3), find the 
maximum of the Be{a,j3) density. 

2.16 Establish the validity of Knuth (1981) B{n,p) generator: 

Algorithm A. 11 -Binomial— 

Define fc = n, 0 = p and x = 0. 

1. Repeat f = [1 + ^0] 

V ^ [A.U] 

e = 0/V and k ^ i - 1; 

otherwise, x = x i, 0 — (B — K)/{1 — V) and k ^ k — i 
until k < 

2. For i = 1, 2, . . . , fc, 

generate Ui 

If Ui < X — H“ 1. 

3. Take x. 

2.17 Establish the claims of Example 2.11: If t/i, . . . , f/n is an iid sample from ^o,i] 
and f/(i) < • • • < U(n) are the corresponding order statistics, show that 

(a) ~ Be(i, n — i + 1); 

(b) {U^h),U(i2) - ~ 

ik + 1 ); 

(c) If U and V are iid U[o^i], the distribution of 

Ul/oc 

jji/oc yi//3’ 

conditional on < 1, is the Be{a, /3) distribution. 

(d) Show that the order statistics can be directly generated via the Renyi rep- 
resentation U(^i) = where the i/^’s are iid Sxp{l). 

2.18 For the generation of a Student’s t distribution, T(i/, 0, 1), Kinderman et al. 
(1977) provide an alternative to the generation of a normal random variable and 
a chi squared random variable. 

Algorithm A. 12 —Student’s t— 

1. Generate t/j, ^ l/([0, 1]) . 

2. It Ui < 0.5, X = l/{4f/i - 1) said V = 
otherwise, X = 4Ui —3 and V = U 2 . 

3. If V<l-{|X|/2) or K < (1 + take X ; 
otherwi a e , r epe at . 

Validate this algorithm and compare it with the algorithm of Example 2.13. 

2.19 For a G [0, 1], show that the algorithm 

Algorithm A. 13 
Generate 

U - W([0, 1]) [A.13] 



until U < a . 
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produces a simulation from Z/^([0, a]). Compare it with the transform aU, U ~ 
U{0, 1) for values of a close to 0 and close to 1. 

2.20 In each case generate the random variables and compare the histogram to the 
density function 

(a) Normal random variables using a Cauchy candidate in Accept-Reject 

(b) Gamma(4.3, 6.2) random variables using a Gamma(4, 7). 

(c) Truncated normal: Standard normal truncated to (2, oo) 

2.21 An efficient algorithm for the simulation of Gamma ^a(Q;, 1) distributions is 
based on Burr’s distribution, a distribution with density 

It has been developed by Cheng (1977) and Cheng and Feast (1979). (See De- 
vroye 1985.) For a > 1, it is 

Algorithm A. 14 -Cheng and Feast’s Gamma- 

Define ci=o — 1 , C2 = (at- (l/6a))/ci , C3 = 2 /ci , C4 = l + Cs, 
and cs — l/^/a. 

1 . Repeat 
generate Ui,U 2 

take Ui = U 2 + 05(1 — I. 86 E/ 1 ) if a > 2.5 
until 0 < Ui < 1 . 

2 . W = C2U2/Uu 

3 . If caf/i + W~^ < C 4 or C 3 log U\ — log W -\-W < 1 , 
take CiW ; 

otherwise , repeat . 



(a) Show p is a density. 

(b) Show that this algorithm produces variables generated from Qa{a, 1). 

2.22 Ahrens and Dieter (1974) propose the following algorithm to generate a 
Gamma Qa{a, 1) distribution: 

Algorithm A. 15 -Ahrens and Dieter’s Gamma- 

1. Generate 

2. If i7a>e/{e + o), r — -log{(o-|-€)(l = 

otherwise, jr = {(a + and y = - [A. 15] 

3. If Ui <y^ take x\ 
otherwise, repeat. 

Show that this algorithm produces variables generated from Gd{ct, !)• Compare 
with Problem 2.21. 

2.23 To generate the Beta distribution Be{a, f3) we can use the following represen- 
tation: 

(a) Show that, if Yi ~ 5a(a, 1), I 2 ~ 1)? then 



(b) Use part (a) to construct an algorithm to generate a Beta random variable. 
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(c) Compare this algorithm with the method given in Problem 2.17 for different 
values of (a, /3). 

(d) Compare this algorithm with an Accept-Reject algorithm based on (i) the 
uniform distribution; (ii) the truncated normal distribution (when a > 1 
and !3>1). 

{Note: See Schmeiser and Shalaby 1980 for an alternative Accept-Reject algo- 
rithm to generate Beta rv’s.) 

2.24 (a) Show that Student’s t density can be written in the form (2.6), where 

h\{x\y) is the density of A7(0, u/y) and h 2 {y) is the density of x^- 
(b) Show that Fisher’s Tm,u density can be written in the form (2.6), with 
h\{x\y) the density of Qa{ml2^v Im) and h 2 {y) the density of x^- 

2.25 The noncentral chi squared distribution, Xp(^)? can be defined by a mixture 
representation (2.6), where h\{x\K) is the density of Xp+ 2 K and h 2 {k) is the 
density of 'P(A/2). 

(a) Show that it can also be expressed as the sum of a Xp-i random variable 
and of the square of a standard normal variable. 

(b) Compare the two algorithms which can be derived from these representa- 
tions. 

(c) Discuss whether a direct approach via an Accept-Reject algorithm is at all 
feasible. 

2.26 (Walker 1997) Show that the Weibull distribution, We(a,^), with density 

f{x\a, /3) = I3ax°'~^ exp {-Px^") , 

can be represented as a mixture of A ~ Be{a,u;^^^) by a; ~ ^a(2,/?). Examine 
whether this representation is helpful from a simulation point of view. 

2.27 An application of the mixture representation can be used to establish the 
following result (see Note 2.6.3): 

Lemma 2.27. If 

/(^) ^ ^ 

where fi and f 2 are probability densities such that fi{x) > ef 2 {x), the algorithm 
Generate 

until (7>e/2(A'}//i(X). 

produces a variable X distributed according to /. 

(a) Show that the distribution of X satisfies 

(b) Evaluate the integral in (a) to complete the proof. 

2.28 (a) Demonstrate the equivalence of Corollary 2.17 and the Accept-Reject 

algorithm. 
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(b) Generalize Corollary 2.17 to the multivariate case. That is, for X = 
(Xi, X 2 , . . . , Xp) ~ /(xi, X 2 , . . . , Xp) = /(x), formulate a joint distribu- 
tion (X, f/) 1(0 < u < /(x)), and show how to generate a sample from 

/(x) based on uniform random variables. 

2.29 (a) Referring to (2.10), if 7t(^|x) is the target density in an Accept-Reject 
algorithm, and 7t(6) is the candidate density, show that the bound M can 
be taken to be the likelihood function evaluated at the MLE. 

(b) For estimating a normal mean, a robust prior is the Cauchy. For X ~ 
N{9, 1), 6 ^ Cauchy(0, 1), the posterior distribution is 



7t{0\x) oc 



1 1 
7t(1 -|- ^2) 27T^ 



(x-ef/2 



Use the Accept-Reject algorithm, with a Cauchy candidate, to generate a 
sample from the posterior distribution. 

{Note: See Problem 3.19 and Smith and Gelfand 1992.) 

2.30 For the Accept-Reject algorithm [A. 4], with / and g properly normalized, 

(a) Show that the probability of accepting a random variable is 

f{X) \ ^ 1 
Mg{X) ) M' 

(b) Show that M > 1. 

(c) Let N be the number of trials until the kth random variable is accepted. 
Show that, for the normalized densities, N has the negative binomial dis- 
tribution Afeg{k,p), where p = 1/M. Deduce that the expected number of 
trials until k random variables are obtained is kM. 

(d) Show that the bound M does not have to be tight; that is, there may be 
M' < M such that f{x) < M'g{x). Give an example where it makes sense 
to use M instead of M' . 

(e) When the bound M is too tight (i.e., when f{x) > Mg{x) on a non- 
negligible part of the support of /), show that the algorithm [A. 4] does 
not produce a generation from /. Give the resulting distribution. 

(f) When the bound is not tight, show that there is a way, using Lemma 2.27, 
to recycle part of the rejected random variables. {Note: See Casella and 
Robert 1998 for details.) 

2.31 For the Accept-Reject algorithm of the Qa{n, 1) distribution, based on the 

Sxp{\) distribution, determine the optimal value of A. 

2.32 This problem looks at a generalization of Example 2.19. 

(a) If the target distribution of an Accept-Reject algorithm is the Gamma 
distribution Qa{a,/3), where a > 1 is not necessarily an integer, show that 
the instrumental distribution Qa{a,b) is associated with the ratio 

f{x) ^ r(g) 
g{x) r{a) 6“ 

(b) Why do we need a < a and b </3? 

(c) For a — [a \ , show that the bound is maximized (in x) at x = {a — a)/{P — b). 

(d) For a = [aj , find the optimal choice of b. 

(e) Compare with a' = [a\ — 1, when o; > 2. 



PIU < 
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2.33 The right-truncated Gamma distribution TQ{a,b,t) is defined as the restric- 
tion of the Gamma distribution Qa{a,b) to the interval (0, t). 

(a) Show that we can consider t = 1 without loss of generality. 

(b) Give the density / of TQ{a^b^ 1) and show that it can be expressed as the 
following mixture of Beta Be(a, k 1) densities: 



/w = 



6“e- 



k\ 



b l,k 






where 7 (a, 6) = 

(c) If / is replaced with gn which is the series truncated at term k = show 
that the acceptance probability of the Accept-Reject algorithm based on 
(9nJ) is 

_ 7(^ + 1,^) 

n! 

1 _ 7(q + ^ + l,6)r(g) ‘ 
r{a + n -f 1)7(0, b) 

(d) Evaluate this probability for different values of (a, 6). 

(e) Give an Accept-Reject algorithm based on the pair (pn,/) and a com- 
putable bound. {Note: See Philippe 1997c for a complete resolution of the 
problem.) 

2.34 Let f{x) = exp(— x^/2) and g{x) = 1/(1 + x^), densities of the normal and 

Cauchy distributions, respectively (ignoring the normalization constants). 

(a) Show that the ratio 

= (1 + X^) < 2/v/e, 

which is attained at x = ±1. 

(b) S how t hat for the normalized densities, the probability of acceptance is 
-\/e/27r — 0.66., which implies that, on the average, one out of every three 
simulated Cauchy variables is rejected. Show that the mean number of trials 
to success is 1/.66 = 1.52. 

(c) Replacing p by a Cauchy density with scale parameter cr, 

9cr{x) = l/{7ra(l + , 

show that the bound on f/g^ is 2cr“^ exp{cr^/2 — 1} and is minimized by 
(7^ = 1. (This shows that C(0, 1) is the best choice among the Cauchy 
distributions for simulating a AT(0, 1) distribution.) 

2.35 There is a direct generalization of Corollary 2.17 that allows the proposal 

density to change at each iteration. 



Algorithm A. 16 —Generalized Accept-Reject— 

At iteration ^ > 1) 

1. Generate Xi ^ gi and 6^4 ~ £/([0, 1]) , independently. 

2. If Ui< €if{Xi)/gi{Xi), accept Xi ^ /; 

3 . othervise , move to iteration t H- 1 . 
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(a) Let Z denote the random variable that is output by this algorithm. Show 
that Z has the cdf 

oo i— 1 
i=\ j=l 

(b) Show that 

oo i — 1 oo 

if and only if log(l — Ci) diverges . 

i=i j=i i=i 

Deduce that we have a valid algorithm if the second condition is satisfied. 

(c) Give examples of sequences Ci that satisfy, and do not satisfy, the require- 
ment of part (6). 

2.36 (a) Prove the validity of the ARS Algorithm [A. 7], without the envelope step, 

by applying Algorithm [A. 16]. 

(b) Prove the validity of the ARS Algorithm [A. 7], with the envelope step, 
directly. Note that 

P(X < xIAccept) = p(x <x\\u < or U < M 

V “ I Mgm Mgm j J 

and 

^ u ^ < xr-| ’ 

f ^ 9m ^Qm j ^Qm J f ^ Qm ^ Qm j 

which are disjoint. 

2.37 Based on the discussion in Example 2.26, write an alternate algorithm to 
Algorithm [A. 17] that does not require the calculation of the density gn- 

2.38 The histogram and density of Figure 2.8 give the candidate gn for 7r(a|x, y, 6), 
the conditional density of the intercept a in log X = a + bt, where we set b = .025 
and — 5. Produce the same picture for the slope, 6, when we set the intercept 
a = .15 and = 5. 

2.39 Step 1 of [A. 7] relies on simulations from gn- Show that we can write 




9n — '^n 



ME 



^ + l^-rPrn + l 



I[xri+i, + oo ](a;)} . 



where y = aiX + /3i is the equation of the segment of line corresponding to gn 
on [xi,Xi+i], Vn denotes the number of segments, and 



'dx+zto i:r 

1 "T 2^i=0 ^ a. 



when supp / = R. 

Verify that this representation as a sequence validates the following algorithm 
for simulation from gn’ 
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Algorithm A, 17 —Supplemental ARS Algorithm- 



1. Select the interval [ij, 2 : 4 + 1 ] with probability 




. 


[A17] 




2. Generate V ^ and take 




X = log[e“'"‘ + - e“*"01- 





Note that the segment Oi + is not the same as the line ai + biX used in (2.16) 

2.40 As mentioned in Section 2.4, many densities are log-concave. 

(a) Show that the so-called natural exponential family, 

dP0(x) = exp{x • 6 - 'ip{6)}du{x) 

is log-concave. 

(b) Show that the logistic distribution of (2.23) is log-concave. 

(c) Show that the Gumhel distribution 

f{x) = exp {-kx - ke , /c G N* , 

is log-concave (Gumbel 1958). 

(d) Show that the generalized inverse Gaussian distribution, 

f{x) oc , x>0, a>0, ^>0, 

is log-concave. 

2.41 (George et al. 1993) For the natural exponential family, the conjugate prior 
measure is defined as 



d'K{0\xo, no) (X exp{xo • 0 - noV^(^)}d^, 



with no > 0. (See Brown 1986, Chapter 1, for properties of exponential families.) 

(a) Show that 

if{xQ, no) = log / exp{xo • 6 - no'il){9)}d0 
Je 



is convex. 

(b) Show that the so-called conjugate likelihood distribution 



j p P 

L(xo,no|^i, . . . ,^p) oc exp < xo • - noY^'ip(O) -p(p(xo,no) 

I i=l i=l 

is log-concave in (xo,no). 

(c) Deduce that the ARS algorithm applies in hierarchical Bayesian models with 
conjugate priors on the natural parameters and log-concave hyperpriors on 
(xoj no) • 

(d) Apply the ARS algorithm to the case 



XijOi ^ V(Oiti), Oi Qa(a, j3), z = l,...,n, 
with fixed a and f3 ~ 6a(0.1, 1). 
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2.42 In connection with Example 2.25, 

(a) Show that a sum of log-concave functions is a log-concave function. 

(b) Deduce that (2.15) is log-concave. 

2.43 (Casella and Berger 2001, Section 8.3) This problem examines the relation- 
ship of the property of log-concavity with other desirable properties of density 
functions. 

(a) The property of monotone likelihood ratio is very important in the con- 
struction of hypothesis tests, and in many other theoretical investigations. 
A family of pdfs or pmfs {g{t\0): 6 6 O} for a univariate random variable 
T with real-valued parameter 6 has a monotone likelihood ratio (MLR) if, 
for every 02 > ^i, g{t\ 02 ) / g{t\9i) is a monotone (nonincreasing or nonde- 
creasing) function of t on {t: g{t\6i) > 0 or g{t\02) > 0}. Note that c/0 is 
defined as oo if 0 < c. 

Show that if a density is log-concave, it has a monotone likelihood ratio. 

(b) Let f{x) be a pdf and let a be a number such that, a > x > y then 
fid) > /(^) > f{y) and, if a < X < ^ then /(a) > f{x) > f{y). Such a pdf 
is called unimodal with a mode equal to a. 

Show that if a density is log-concave, it is unimodal. 

2.44 This problem will look into one of the failings of congruential generators, 
the production of parallel lines of output. Consider a congruential generator 
D(x) — ax mod 1, that is, the output is the fractional part of ax. 

(a) For /c = 1, 2, . . . , 333, plot the pairs (fc* 0.003, D(/c* 0.003)) for a = 5, 20, 50. 
What can you conclude about the parallel lines? 

(b) Show that each line has slope a and the lines repeat at intervals of 1/a 
(hence, larger values of a will increase the number of lines). {Hint: Let 
X = ^ + 5, for i — 1, . . . , a and 0 < (5 < For this x, show that D{x) = a6^ 
regardless of the value of L) 

2.6 Notes 

2.6.1 The Kiss Generator 

Although this book is not formally concerned with the generation of uniform random 
variables (as we start from the assumption that we have an endless supply of such 
variables), it is good to understand the basic workings and algorithms that are 
used to generate these variables. In this note we describe the way in which uniform 
pseudo-random numbers are generated, and give a particularly good algorithm. 

To keep our presentation simple, rather than give a catalog of random number 
generators, we only give details for a single generator, the Kiss algorithm of Marsaglia 
and Zaman (1993). For details on other random number generators, the books of 
Knuth (1981), Rubinstein (1981), Ripley (1987), and Fishman (1996) are excellent 
sources. 

As we have remarked before, the finite representation of real numbers in a com- 
puter can radically modify the behavior of a dynamic system. Preferred generators 
are those that take into account the specifics of this representation and provide 
a uniform sequence. It is important to note that such a sequence does not really 
take values in the interval [0, 1] but rather on the integers {0, 1, ... , M}, where M 
is the largest integer accepted by the computer. One manner of characterizing the 
performance of these integer generators is through the notion of period. 
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Definition 2.28. The period^ To, of a generator is the smallest integer T such that 
UiJ^T = Ui for every i; that is, such that is equal to the identity function. 

The period is a very important parameter, having direct impact on the usefulness 
of a random number generator. If the number of needed generations exceeds the 
period of a generator, there may be noncontrollable artifacts in the sequence (cyclic 
phenomena, false orderings, etc.). Unfortunately, a generator of the form Xn+i = 
f{Xn) has a period no greater than M + 1, for obvious reasons. In order to overcome 
this bound, a generator must utilize several sequences X^ simultaneously (which is 
a characteristic of Kiss) or must involve Xn-i, Xn- 2 , ... in addition to Xn, or must 
use other methods such as start-up tables, that is, using an auxiliary table of random 
digits to restart the generator. 

Kiss simultaneously uses two generation techniques, namely congruential gener- 
ation and shift register generation. 

Definition 2.29. A congruential generator on {0, 1, ... , M} is defined by the func- 
tion 

D{x) — {ax -h h) mod (M -h 1). 

The period and, more generally, the performance of congruential generators de- 
pend heavily on the choice of (a, h) (see Ripley 1987). When transforming the above 
generator into a generator on [0, 1], with D{x) = {ax-[-h)/{M -\- 1) mod 1, the graph 
of D should range throughout [0, 1]^, and a choice of the constant a ^ Q would yield 
a “recovery ” of [0, 1]^; that is, an infinite sequence of points should fill the space. 

Although ideal, the choice of an irrational a is impossible (since a needs be 
specified with a finite number of digits). With a rational, a congruential generator 
will produce pairs {xn, D{xn)) that lie on parallel lines. Figure 2.9 illustrates this 
phenomenon for a = 69069, representing the sequence {3k 10~^ , D{3k)) for k = 
1,2, ...,333. It is thus important to select a in such a way as to maximize the 
number of parallel segments in [0, 1]^ (see Problem 2.44). 

Most commercial generators use congruential methods, with perhaps the most 
disastrous choice of (a, 6) being that of the old (and notorious) procedure RANDU 
(see Ripley 1987). Even when the choice of {a,h) assures the acceptance of the 
generator by standard tests, nonuniform behavior will be observed in the last digits 
of the real numbers produced by this method, due to round-up errors. 

The second technique employed by Kiss is based on the (theoretical) indepen- 
dence between the k binary components of Xn ~ ^{o,i,...,m} (where M = 2^ — 1) 
and is called a shift register generator. 

Definition 2.30. For a given k x k matrix T, whose entries are either 0 or 1, the 
associated shift register generator is given by the transformation 

Xn-{-l — TXni 

where Xn is represented as a vector of binary coordinates Cni, that is to say, 

k-l 

Xn — ^ ^ , 

i=0 



with Cni equal to 0 or 1. 
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Fig. 2.9. Representation of the line y = 69069a; mod 1 by uniform sampling with 
sampling step 3 iO“^. 



This second class of generators is motivated by both the internal (computer-de- 
pendent) representation of numbers as sequences of hits and the speed of manip- 
ulation of these elementary algebraic operations. Since the computation of Txn is 
done modulo 2, each addition is then the equivalent of a logical OR. Moreover, as 
the matrix T only contains 0 and 1 entries, multiplication by T amounts to shifting 
the content of coordinates, which gives the technique its name. 

For instance, if the ith line of T contains a 1 in the Rh and jth positions uniquely, 
the ith coordinate of Xn+i, will be obtained by 

^(n+l)t — i^ni “h 6nj) mod 2 
— Gni V 6nj Cnt A Gnj 

where a A 6 = min(a, 6) and a V b = max(a, 6). This is a comparison of the zth 
coordinate of Xn and the coordinate corresponding to a shift of {j — i ) . There also 
exist sufficient conditions on T for the associated generator to have period 2^ (see 
Ripley 1987). 

The generators used by Kiss are based on the matrices 




whose entries are 1 on the main diagonal and on the first upper diagonal and first 
lower diagonal, respectively, the other elements being 0. They are related to the 
right and left shift matrices, 

R(ei, . . . , GkY = (0, ei, . . . , efc-l)^ 

L(ei, . . .,GkY = (e2,C3, . . . ,efc,0)^ 



since Tr = {I R) and Tl = (/ + L). 
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To generate a sequence of integers Xi, X 2 , . . the Kiss algorithm generates three 
sequences of integers. First, the algorithm uses a congruent ial generator to obtain 

/n+i = (69069 X /„ + 23606797) (mod 2®^) , 

and then two shift register generators of the form, 

Jn+i = (/ + L^®)(/ + Jn (mod 2®®) , 

Kn+i = (/ + L^®)(/ + i?^®) Kn (mod 2®^) . 

These are then combined to produce 

Xn-\-l — (7n+l + Jn+1 + i^n+l) mod 2^^ . 

Formally, this algorithm is not of the type specified in Definition 2.1, since it 
uses three parallel chains of integers. However, this feature yields advantages over 
algorithms based on a single dynamic system Xn+i = f{Xn) since the period of Kiss 
is of order 2^^, which is almost (2^^)^. In fact, the (usual) congruential generator In 
has a maximal period of 2^^, the generator of Kn has a period of 2^^ - 1 and that of 
Jn a period of 2^^ — 2^^ — 2^^ + 1 for almost all initial values Jo (see Marsaglia and 
Zaman 1993 for more details). The Kiss generator has been successfully tested on 
the different criteria of Die Hard, including tests on random subsets of hits. Figure 
2.10 presents plots of (Xn,Xn+l), (Xn,Xn+2), (Xn,Xn+s) and (Xn,Xn+lo) for 
n = 1,...,5000, where the sequence (Xn) has been generated by Kiss, without 
exhibiting any nonuniform feature on the square [0, 1]^. A version of this algorithm 
in the programming language C is given below. 

Algorithm A, 18 —The Kiss Algorithm- 

long int kiss (i,j,k) 
unsigned long *i,*j,*k 
{ 

♦j = « 17); [yl,18] 

*k = C*k A C*k « 18)) & 0X7FFFFFFF ; 
return ((*i = 69069 * C»i) + 23606797) + 

C*j A = » 15)) + C*k A = (*k » 13)> ; 

} 



(See Marsaglia and Zaman 1993 for a Fortran version of Kiss). Note that some care 
must be exercised in the use of this program as a generator on [0, 1], since it implies 
dividing by the largest integer available on the computer and may sometimes result 
in uniform generation on [—1, 1]. 

2.6.2 Quasi-Monte Carlo Methods 

Quasi-Monte Carlo methods were proposed in the 1950s to overcome some drawbacks 
of regular Monte Carlo methods by replacing probabilistic bounds on the errors 
with deterministic bounds. The idea at the core of quasi-Monte Carlo methods is to 
substitute the randomly (or pseudo-randomly) generated (uniform [0, 1]) sequences 
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Fig. 2.10. Plots of pairs (Vt, Vt+i), (Xt, Vt+2), (Xt^Xt+d) and (Xt,Xt+io) for a 
sample of 5000 generations from Kiss. 



used in regular Monte Carlo methods with a deterministic sequence (xn) on [0, 1] in 
order to minimize the so-called divergence^ 

1 "" 

D{xi,. . . ,Xn) = sup - U 

0<u<l ^ • ■, 

— — l—i. 

This is also the Kolmogorov-Smirnov distance between the empirical cdf and that of 
the uniform distribution, used in nonparametric tests. For fixed n, the solution is ob- 
viously Xi = (2i — l)/2n in dimension 1, but the goal here is to get a low- discrepancy 
sequence (xn) which provides small values of D{xi , . . . , Xn) for all n’s (i.e., such that 
xi, . . . , Xn-i do not depend on n) and can thus be updated sequentially. 

As shown in Niederreiter (1992), there exist such sequences, which ensure a di- 
vergence rate of order 0{n~^ where d is the dimension of the integration 

space.® Since, for any function h defined on [0, 1], it can be shown that the divergence 
is related to the overall approximation error by 

1 7 ^ 

(2.17) -J2h{xi)- h{x)dx <V{h)D{xi,...,Xn) 

i=i JO 

(see Niederreiter 1992), where V{h) is the total variation of /i, 

® The notation 0{\/n) denotes a function that satisfies 0 < limn^oo nO(l/n) < cxd. 
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N 

V(h) = lim sup 'Y' \h{xj) - h{xj-i )\ , 

with xo = 0 and xat = 1, the gain over standard Monte Carlo methods can be 
substantial since standard methods lead to order errors (see Section 4.3). 

The advantage over standard integration techniques such as Riemann sums is also 
important when the dimension d increases since the latter are of order (see 
Yakowitz et al. 1978). 

The true comparison with regular Monte Carlo methods is, however, more del- 
icate than a simple assessment of the order of convergence. Construction of these 
sequences, although independent from h, can be quite involved (see Fang and Wang 
1994 for examples), even though they only need to be computed once. More impor- 
tantly, the construction requires that the functions to be integrated have bounded 
support, which can be a hindrance in practice because the choice of the transforma- 
tion to [0, 1]^^ is crucial for the efficiency of the method. See Niederreiter (1992) for 
extensions in optimization setups. 

2.6.3 Mixture Representations 

Mixture representations, such as those used in Examples 2.13 and 2.14, can be ex- 
tended (theoretically, at least) to many other distributions. For instance, a random 
variable X (and its associated distribution) is called infinitely divisible if for every 
n there exist iid random variables , • • • , such that X ~ -h • • • + X'^ (see 
Feller 1971, Section XVII. 3 or Billingsley 1995, Section 28). It turns out that most 
distributions that are infinitely divisible can be represented as mixtures of Poisson 
distributions, the noncentral XpW distribution being a particular case of this phe- 
nomenon. (However, this theoretical representation does not necessarily guarantee 
that infinitely divisible distributions are always easy to simulate.) 

We also note that if the finite mixture 

k 

Y P' 

i=l 

can result in a decomposition of f{x) into simple components (for instance, uniform 
distributions on intervals) and a last residual term with a small weight, the following 
approximation applies: We can use a trapezoidal approximation of / on intervals 
[ai,6i], the weight pi being of the order of f{x)dx. Devroye (1985) details the 
applications of this method in the case where / is a polynomial on [0, 1]. 
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Monte Carlo Integration 



Cadfael had heard the words without hearing them and enlightenment fell 
on him so dazzlingly that he stumbled on the threshold. 

— Ellis Peter, The Heretic’s Apprentice 



While Chapter 2 focussed on developing techniques to produce random vari- 
ables by computer, this chapter introduces the central concept of Monte Carlo 
methods, that is, taking advantage of the availability of computer generated 
random variables to approximate univariate and multidimensional integrals. 
In Section 3.2, we introduce the basic notion of Monte Carlo approximations 
as a byproduct of the Law of Large Numbers, while Section 3.3 highlights the 
universality of the approach by stressing the versatility of the representation 
of an integral as an expectation. 



3.1 Introduction 

Two major classes of numerical problems that arise in statistical inference are 
optimization problems and integration problems. (An associated problem, that 
of implicit equations, can often be reformulated as an optimization problem.) 
Although optimization is generally associated with the likelihood approach, 
and integration with the Bayesian approach, these are not strict classifica- 
tions, as shown by Examples 1.5 and 1.15, and Examples 3.1, 3.2 and 3.3, 
respectively. 

Examples 1.1-1.15 have also shown that it is not always possible to derive 
explicit probabilistic models and that it is even less possible to analytically 
compute the estimators associated with a given paradigm (maximum likeli- 
hood, Bayes, method of moments, etc.). Moreover, other statistical methods, 
such as bootstrap methods (see Note 1.6.2), although unrelated to the Bayesian 
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approach, may involve the integration of the empirical cdf. Similarly, alter- 
natives to standard likelihood, such as marginal likelihood, may require the 
integration of the nuisance parameters (Barndorff-Nielsen and Cox 1994). 

Although many calculations in Bayesian inference require integration, this 
is not always the case. Integration is clearly needed when the Bayes estima- 
tors are posterior expectations (see Section 1.3 and Problem 1.22), however 
Bayes estimators are not always posterior expectations. In general, the Bayes 
estimate under the loss function L(0, 5) and the prior tt is the solution of the 
minimization program 

(3.1) min f L{6^6) 7 t{ 9) f{x\6) d6 . 

^ Je 

Only when the loss function is the quadratic function ||0 — will the Bayes 
estimator be a posterior expectation. While some other loss functions lead to 
general solutions 5^{x) of (3.1) in terms of 7t{9\x) (see, for instance, Robert 
1996b, 2001 for the case of intrinsic losses), a specific setup where the loss 
function is constructed by the decision-maker almost always precludes ana- 
lytical integration of (3.1). This necessitates an approximate solution of (3.1) 
either by numerical methods or by simulation. 

Thus, whatever the type of statistical inference, we are led to consider 
numerical solutions. The previous chapter has illustrated a number of methods 
for the generation of random variables with any given distribution and, hence, 
provides a basis for the construction of solutions to our statistical problems. 
Thus, just as the search for a stationary state in a dynamical system in physics 
or in economics can require one or several simulations of successive states of 
the system, statistical inference on complex models will often require the use of 
simulation techniques. (See, for instance, Bauwens 1984, Bauwens and Richard 
1985 and Gourieroux and Monfort 1996 for illustrations in econometrics.) 
We now look at a number of examples illustrating these situations before 
embarking on a description of simulation-based integration methods. 

Example 3.1. L\ loss. For ^ G M and L(^, J) = the Bayes estimator 

associated with tt is the posterior median of 7t(^|x), 6'^ {x), which is the solution 
to the equation 

(3.2) [ 7t{ 9) f{x\9) d9 = [ 7 t{ 9) f{x\9) d9 . 

Je<5^{x) Je>s^{x) 

In the setup of Example 1.7, that is, when A = ||^|p and X ~ J\fp{9,Ip), this 
equation is quite complex, since, when using the reference prior of Example 
1.12, 

r 

7t(A|x) oc Y[ sin((/?i)^“*“^ dipi . . . d(fp-i , 

^ i=i 

where A,(pi, . . . are the polar coordinates of 9, that is, 9i = Acos((^i), 

92 = Asin(</?i)cos((/? 2 ), • • •• II 
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Example 3.2. Piecewise linear and quadratic loss functions. Consider 
a loss function which is piecewise quadratic, 

(3.3) L(0, 5) = Wi{9 — when 9 — S e [a^, a^+i), uji > 0. 

Differentiating the posterior expectation (3.3) shows that the associated Bayes 
estimator satisfies 

Wi f {9 — S'^{x)) 7 t{9\x) d9 = 0 ^ 

i 

that is, 

. . ^ rr 0A0)m0)de 
Ei i:r ■ 

Although formally explicit, the computation of S'^{x) requires the computation 
of the posterior means restricted to the intervals [a^, a^+i) and of the posterior 
probabilities of these intervals. 

Similarly, consider a piecewise linear loss function, 

h{9,5) = Wi\9 - 5\ if 9 - 5 e[ai,aij^i), 

or Huber’s (1972) loss function. 



L{9,S) 



p{9 - if \9-6\< c, 

2pc{\9 — (5 1 — c/2} otherwise. 



where p and c are specified constants. Although a specific type of prior dis- 
tribution leads to explicit formulas, most priors result only in integral forms 
of . Some of these may be quite complex. || 



Inference based on classical decision theory evaluates the performance of 
estimators (maximum likelihood estimator, best unbiased estimator, moment 
estimator, etc.) through the loss imposed by the decision-maker or by the 
setting. Estimators are then compared through their expected losses, also 
called risks. In most cases, it is impossible to obtain an analytical evaluation 
of the risk of a given estimator, or even to establish that a new estimator 
(uniformly) dominates a standard estimator. 

It may seem that the topic of James-Stein estimation is an exception to 
this observation, given the abundant literature on the topic. In fact, for some 
families of distributions (such as exponential or spherically symmetric) and 
some types of loss functions (such as quadratic or concave), it is possible to 
analytically establish domination results over the maximum likelihood esti- 
mator or unbiased estimators (see Lehmann and Casella 1998, Chapter 5 or 
Robert 2001, Chapter 2). Nonetheless, in these situations, estimators such as 
empirical Bayes estimators^ which are quite attractive in practice, will rarely 
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allow for analytic expressions. This makes their evaluation under a given loss 
problematic. 

Given a sampling distribution f{x\9) and a conjugate prior distribution 
7 t( 0|A, //), the empirical Bayes method estimates the hyperparameters X and fi 
from the marginal distribution 

m{x\\,ii)= J f{x\0) 7r(0|A,/i) dO 

by maximum likelihood. The estimated distribution 7r(^|A,/i) is often used as 
in a standard Bayesian approach (that is, without taking into account the 
effect of the substitution) to derive a point estimator. See Searle et al. (1992, 
Chapter 9) or Carlin and Louis (1996) for a more detailed discussion on this 
approach. (We note that this approach is sometimes called parametric empir- 
ical Bayes, as opposed to the nonparametric empirical Bayes approach devel- 
oped by Herbert Robbins. See Robbins 1964, 1983 or Maritz and Twin 1989 
for details.) The following example illustrates some difficulties encountered in 
evaluating empirical Bayes estimators (see also Example 4.12). 



Example 3.3. Empirical Bayes estimator. Let X have the distribution 
X ~ Ap(^, Ip) and let 9 ~ Afp{/a, XIp)^ the corresponding conjugate prior. The 
hyperparameter fi is often specified, and here we take /i = 0. In the empirical 
Bayes approach, the scale hyperparameter A is replaced by the maximum 
likelihood estimator. A, based on the marginal distribution X ~ Ap(0, (A + 
l)/p). This leads to the maximum likelihood estimator A = (IN|Vp-i)+. 
Since the posterior distribution of 9 given A is Ap(Ax/(A + 1), A/p/(A -f 1)), 
empirical Bayes inference may be based on the pseudo-posterior J\fp(Xx/(X + 
1), A/p/(A -h 1)). If, for instance, ||^|p is the quantity of interest, and if it is 
evaluated under a quadratic loss, the empirical Bayes estimator is 






A 

A “h 1 





= (INp-p)+. 




+ 



This estimator dominates both the best unbiased estimator, ||xp — p, and 
the maximum likelihood estimator based on ||x|p ~ Xp(||^P) (see Saxena and 
Alam 1982 and Example 1.8). However, since the proof of this second domi- 
nation result is quite involved, one might first check for domination through 
a simulation experiment that evaluates the risk function. 
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for the three estimators. This quadratic risk is often normalized by l/(2||0|p 
+ p) (which does not affect domination results but ensures the existence of 
a minimax estimator; see Robert 2001). Problem 3.8 contains a complete 
solution to the evaluation of risk. || 

A general solution to the different computational problems contained in 
the previous examples and in those of Section 1.1 is to use simulation, of 
either the true or approximate distributions to calculate the quantities of 
interest. In the setup of Decision Theory, whether it is classical or Bayesian, 
this solution is natural, since risks and Bayes estimators involve integrals 
with respect to probability distributions. We will see in Chapter 5 why this 
solution also applies in the case of maximum likelihood estimation. Note that 
the possibility of producing an almost infinite number of random variables 
distributed according to a given distribution gives us access to the use of 
frequentist and asymptotic results much more easily than in usual inferential 
settings (see Serfling 1980 or Lehmann and Casella 1998, Chapter 6) where the 
sample size is most often fixed. One can, therefore, apply probabilistic results 
such as the Law of Large Numbers or the Central Limit Theorem, since they 
allow for an assessment of the convergence of simulation methods (which is 
equivalent to the deterministic bounds used by numerical approaches.) 



3.2 Classical Monte Carlo Integration 

Before applying our simulation techniques to more practical problems, we 
first need to develop their properties in some detail. This is more easily ac- 
complished by looking at the generic problem of evaluating the integral 

(3.4) Ef[h{X)]= [ h{x)f{x)dx. 

Jx 

Based on previous developments, it is natural to propose using a sample 
(Xi, . . . ,Xm) generated from the density / to approximate (3.4) by the em- 
pirical average^ 

^ m 

hm — ^ ^ ? 

m ^ 

since hm converges almost surely to Ef[h{X)] by the Strong Law of Large 
Numbers. Moreover, when h? has a finite expectation under /, the speed of 
convergence of hm can be assessed since the variance 

var(7^^) = - [ {hix)-Ef[h{X)]ff{x)dx 
^ Jx 

^ This approach is often referred to as the Monte Carlo method, following Metropo- 
lis and Ulam (1949). We will meet Nicolas Metropolis (1915-1999) again in Chap- 
ters 5 and 7, with the simulated annealing and MCMC methods. 
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can also be estimated from the sample (Xi, . . . , through 






1 

rm? 



[h(xj) hjn\ . 

i=i 



For m large, 



hm-Ef[h{X)] 

y/'^m 



is therefore approximately distributed as a A/’(0, 1) variable, and this leads 
to the construction of a convergence test and of confidence bounds on the 
approximation oiEf[h{X)]. 

Example 3.4. A first Monte Carlo integration. Recall the function 
(1.26) that we saw in Example 1.17, h{x) = [cos(50x) + sin(20x)]^. As a 
first example, we look at integrating this function, which is shown in Figure 
3.1 (left). Although it is possible to integrate this function analytically, it is 
a good first test case. To calculate the integral, we generate f/i, t/25 • • • , 
iid ZY(0, 1) random variables, and approximate J h{x)dx with Y) h{Ui)/n. The 
center panel in Figure 3.1 shows a histogram of the values of h{Ui), and the 
last panel shows the running means and standard errors. It is clear that the 
Monte Carlo average is converging, with value of 0.963 after 10, 000 iterations. 
This compares favorably with the exact value of 0.965. (See Example 4.1 for 
a more formal monitoring of convergence.) || 




Function 



Generated Values of Function 



Mean and Standard Errors 



Fig. 3.1. Calculation of the integral of the function (1.26): (left) function (1.26), 
(center) histogram of 10,000 values h(Ui)^ simulated using a uniform generation, 
and (right) mean ± one standard error. 
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n 


0.0 


0.67 


0.84 


1.28 


1.65 


2.32 


2.58 


3.09 


3.72 


10^ 


0.485 


0.74 


0.77 


0.9 


0.945 


0.985 


0.995 


1 


1 


10® 


0.4925 0.7455 


0.801 


0.902 


0.9425 0.9885 0.9955 0.9985 


1 


10^ 


0.4962 0.7425 0.7941 


0.9 


0.9498 0.9896 


0.995 


0.999 


0.9999 


10® 


0.4995 0.7489 0.7993 0.9003 0.9498 0.9898 


0.995 


0.9989 0.9999 


10® 


0.5001 0.7497 


0.8 


0.9002 0.9502 


0.99 


0.995 


0.999 


0.9999 


lO’^ 


0.5002 0.7499 


0.8 


0.9001 0.9501 


0.99 


0.995 


0.999 


0.9999 


10® 


0.5 


0.75 


0.8 


0.9 


0.95 


0.99 


0.995 


0.999 


0.9999 



Table 3.1. Evaluation of some normal quantiles by a regular Monte Carlo exper- 
iment based on n replications of a normal generation. The last line gives the exact 
values. 



The approach followed in the above example can be successfully utilized 
in many cases, even though it is often possible to achieve greater efficiency 
through numerical methods (Riemann quadrature, Simpson method, etc.) in 
dimension 1 or 2. The scope of application of this Monte Carlo integration 
method is obviously not limited to the Bayesian paradigm since, similar to 
Example 3.3, the performances of complex procedures can be measured in 
any setting where the distributions involved in the model can be simulated. 
For instance, we can use Monte Carlo sums to calculate a normal cumulative 
distribution function (even though the normal cdf can now be found in all 
software and many pocket calculators). 

Example 3.5. Normal cdf. Since the normal cdf cannot be written in an 
explicit form, a possible way to construct normal distribution tables is to use 
simulation. Consider the generation of a sample of size n, (xi, . . . , x^), based 
on the Box-Muller algorithm [A 4 ] of Example 2.2.2. 

The approximation of 

= f 

J-oo v27r 

by the Monte Carlo method is thus 

i=l 

with (exact) variance ^(t)(l —^{t))/n (as the variables \xi<t are independent 
Bernoulli with success probability For values of t around t = 0, the vari- 
ance is thus approximately l/4n, and to achieve a precision of four decimals, 
the approximation requires on average n = {y /2 10^)^ simulations, that is, 
200 million iterations. Table 3.1 gives the evolution of this approximation for 
several values of t and shows an accurate evaluation for 100 million iterations. 
Note that greater (absolute) accuracy is achieved in the tails and that more 
efficient simulations methods could be used, as in Example 3.8 below. || 
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We mentioned in Section 3.1 the potential of this approach in evaluating 
estimators based on a decision-theoretic formulation. The same applies for 
testing, when the level of significance of a test, and its power function, cannot 
be easily computed, and simulation thus can provide a useful improvement 
over asymptotic approximations when explicit computations are impossible. 
The following example illustrates this somewhat different application of Monte 
Carlo integration. 

Many tests are based on an asymptotic normality assumption as, for in- 
stance, the likelihood ratio test Given a null hypothesis corresponding to 
r independent constraints on the parameter 0 G denote by 6 and 6^ the 
unconstrained and constrained (under Hq) maximum likelihood estimators of 
9, respectively. The likelihood ratio £{9\x)/i{9^\x) then satisfies 

(3.5) log[£(0|a;)/£(0O|x)] = 2 {log £{e\x) - log e{e^\x)}^xl , 

when the number of observations goes to infinity (see Lehmann 1986, Section 
8.8, or Gourieroux and Monfort 1996). However, the Xr approximation only 
holds asymptotically and, further, this convergence only holds under regularity 
constraints on the likelihood function (see Lehmann and Casella 1998, Chapter 
6, for a full development); hence, the asymptotics may even not apply. 

Example 3.6. Contingency Tables. Table 3.2 gives the results of a study 
comparing radiation therapy with surgery in treating cancer of the larynx. 





Cancer Cancer not 
Controlled Controlled 




Surgery 


21 


2 


23 


Radiation 


15 


3 


18 




36 


5 


41 



Table 3.2. Comparison of cancer treatment success from surgery or radiation only 
{Source: Agresti 1996, p.50). 



Typical sampling models for contingency tables may condition on both 
margins, one margin, or only the table total, and often the choice is based 
on philosophical reasons (see, for example, Agresti 1992). In this case we may 
argue for conditioning on the number of patients in each group, or we may 
just condition on the table total (there is little argument for conditioning on 
both margins). Happily, in many cases the resulting statistical conclusion is 
not dependent on this choice but, for definiteness, we will choose to condition 
only on the table total, n = 41. 

Under this model, each observation Xi comes from a multinomial dis- 
tribution with four cells and cell probabilities p = (pii,Pi 2 ,P 2 i,P 22 ), with 
HijPij = 1> that is, 
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If we denote by yij the number of xi that are in cell ij, the likelihood function 
can be written 

^(p|y) oc ■ 

The null hypothesis to be tested is one of independence, which is to say that 
the treatment has no bearing on the control of cancer. To translate this into 
a parameter statement, we note that the full parameter space corresponding 
to Table 3.2 is 



Pll 


P12 


Pi 


P21 


P22 


1 - Pi 


P 2 


1-P2 


1 



and the null hypothesis of independence is i^o • Pii = PiP 2 - The likelihood 
ratio statistic for testing this hypothesis is 

. . ^ maxp:p„=p,p, ^(p|y) 
maxp%|y) ’ 

It is straightforward to show that the numerator maximum is attained at 
Pi — {yii + 2 / 12 )/^ and the denominator maximum at pij = yij/n. 

As mentioned above, under —2 log A is asymptotically distributed as 

xf. However, with only 41 observations, the asymptotics do not necessarily 
apply. One alternative is to use an exact permutation test (Mehta et al. 2000), 
and another alternative is to devise a Monte Carlo experiment to simulate the 
null distribution of —2 log A or equivalently of A in order to obtain a cutoff 
point for a hypothesis test. If we denote this null distribution by /o(A), and 
we are interested in an a level test, we specify a and solve for A« the integral 
equation 

(3.6) fo{X)dX = l- a, 

Jo 

The standard Monte Carlo approach to this problem is to generate random 
variables A^ ~ fo{X), t = 1,...,M, then order the sample A^^^ < A^^^ < 

. . . \{^) and finally calculate the empirical 1 — a percentile We 

then have 

lim -> Aa . 

M— VOO 

(Note that this is a slightly unusual Monte Carlo experiment in that a is 
known and A^ is not, but it is nonetheless based on the same convergence of 
empirical measures.) 
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Percentile 


Monte Carlo 


X? 


.10 


2.84 


3.87 


.05 


3.93 


4.68 


.01 


6.72 


6.36 



Table 3.3. Cutoff points for the null distribution /o compared to Xi- 



To run the Monte Carlo experiment, we need to generate values from /o(A). 
Since this distribution is not completely specified (the parameters pi and p 2 
can be any value in (0, 1)), to generate a value from /o(A) we generate 

(3.7) p, -^(0,1), i = l,2, 

X ~ Mi(piP 2 ,Pl{l -P 2 ),(l -Pl)p 2 ,(l -Pl){l -P 2 )), 

and calculate A(x). The results, given in Table 3.3 and Figure 3.2, show that 
the Monte Carlo null distribution has a slightly different shape than the Xi 
distribution, being slightly more concentrated around 0 but with longer tails. 

The analysis of the given data is somewhat anticlimactic, as the observed 
value of A(y) is .594, which according to any calibration gives overwhelming 
support to iJo- II 



Null dlatrlbutlon 



Rercentlles 




\ 1 1 1 n 

O 2 A 6 S t O 

Uog Mkelfhoad ratio 




Fig. 3.2. For Example 3.6, histogram of null distribution and approximating 
Xi density (left panel). The right panel gives the running empirical percentiles 
(.90, .95, .99), from bottom to top. Notice the higher variability in the higher per- 
centiles (10,000 simulations). 



Example 3.7. Testing the number of components. A situation where 
the standard Xr regularity conditions do not apply for the likelihood ratio test 
is that of the normal mixture (see Example 1.10) 
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p A/'(jU, 1) + (1 - p) N{p, + 0,1) , 

where the constraint 0 > 0 ensures identifiability. A test on the existence of 
a mixture cannot be easily represented in a hypothesis test since Hq : p = 0 
effectively eliminates the mixture and results in the identifiability problem 
related with A/’(/x + 0,1). (The inability to estimate the nuisance parameter 
p under Hq results in the likelihood not satisfying the necessary regularity 
conditions; see Davies 1977. However, see Lehmann and Casella 1998, Section 
6.6 for mixtures where it is possible to construct efficient estimators.) 




(j 2 4 6 S 10 



Fig. 3.3. Empirical cdf of a sample of log- likelihood ratios for the test of presence 
of a Gaussian mixture (solid lines) and comparison with the cdf of a xi distribution 
(dotted lines, below) and with the cdf of a .5 — .5 mixture of a xl distribution and 
of a Dirac mass at 0 (dotted lines, above) (based on 1000 simulations of a normal 
Af(0, 1) sample of size 100). 



A slightly different formulation of the problem will allow a solution, how- 
ever. If the identifiability constraint is taken to be p > 1/2 instead of ^ > 0, 
then Ho can be represented as 

Ho: p=l or 0 = 0. 

We therefore want to determine the limiting distribution of (3.5) under this 
hypothesis and under a local alternative. Figure 3.3 represents the empiri- 
cal cdf of 2 {log £{p,jl,0\x) - log £{pP\x)} and compares it with the xl cdf, 
where p,ji,0, and pP are the respective MLEs for 1000 simulations of a normal 
^7(0, 1) sample of size 100. The poor agreement between the asymptotic ap- 
proximation and the empirical cdf is quite obvious. Figure 3.3 also shows how 
the xi approximation is improved if the limit (3.5) is replaced by an equally 
weighted mixture of a Dirac mass at 0 and a xl distribution. 

Note that the resulting sample of the log-likelihood ratios can also be used 
for inferential purposes, for instance to derive an exact test via the estimation 
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of the quantiles of the distribution of (3.5) under Hq or to evaluate the power 
of a standard test. || 

It may seem that the method proposed above is sufficient to approximate 
integrals like (3.4) in a controlled way. However, while the straightforward 
Monte Carlo method indeed provides good approximations of (3.4) in most 
regular cases, there exist more efficient alternatives which not only avoid a 
direct simulation from / but also can be used repeatedly for several integrals 
of the form (3.4). The repeated use can be for either a family of functions 
or a family of densities /. In particular, the usefulness of this flexibility 
is quite evident in Bayesian analyses of robustness, of sensitivity (see Berger 
1990, 1994), or for the computation of power functions of specific tests (see 
Lehmann 1986, or Gourieroux and Monfort 1996). 



3.3 Importance Sampling 



3.3.1 Principles 



The method we now study is called importance sampling because it is based 
on so-called importance functions, and although it would be more accurate to 
call it “weighted sampling,” we will follow common usage. We start this sec- 
tion with a somewhat unusual example, borrowed from Ripley (1987), which 
shows that it may actually pay to generate from a distribution other than the 
distribution / of interest or, in other words, to modify the representation of 
an integral as an expectation against a given density. (See Note 3.6.1 for a 
global approach to the approximation of tail probabilities by large deviation 
techniques.) 



Example 3.8. Cauchy tail probability. Suppose that the quantity of in- 
terest is the probability, p, that a Cauchy C(0, 1) variable is larger than 2, 
that is. 



p = 




1 

7t(1 + X^) 



dx . 



When p is evaluated through the empirical average 



Pi 



- m 

m ^ ^ 



j=i 



>2 



of an iid sample X\, . . . ,Xm ~ C(0, 1), the variance of this estimator is 

p{\ —p)jm (equal to 0.127/m since p — 0.15). This variance can be reduced 
by taking into account the symmetric nature of C(0, 1), since the average 






2m 



m 
J = 1 
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has variance p(l — 2p)/2m equal to 0.052/m. 

The (relative) inefficiency of these methods is due to the generation of 
values outside the domain of interest, [2,-hoo), which are, in some sense, ir- 
relevant for the approximation of p. If p is written as 

the integral above can be considered to be the expectation of h{X) = 2/7r(l-h 
X^), where X ~ ^o, 2 ]- An alternative method of evaluation for p is therefore 

1 1 

ft = 2 - E 

.7 = 1 

for Uj ~ ZY[o, 2 ]- The variance of ps is (E[/i^] — E[/i]^)/m and an integration by 
parts shows that it is equal to 0.0285/m. Moreover, since p can be written as 

/• 1/2 y -2 

p "" / ~TT ^ — ^ ’ 

Jo 7r(l-h7/ 2) 

this integral can also be seen as the expectation of | h{Y) = 1/27t( 1 -h Y‘^) 
against the uniform distribution on [0, 1/2] and another evaluation of p is 

m 

A = to E Mii) 

j = l 

when Yj ~ ZY[o,i/ 2 ]- The same integration by parts shows that the variance of 
p 4 is then 0.95 10“^/m. 

Compared with pi, the reduction in variance brought by p 4 is of order 
10“^, which implies, in particular, that this evaluation requires \/1000 32 

times fewer simulations than pi to achieve the same precision. || 

The evaluation of (3.4) based on simulation from / is therefore not nec- 
essarily optimal and Theorem 3.12 shows that this choice is, in fact, always 
suboptimal. Note also that the integral (3.4) can be represented in an infinite 
number of ways by triplets (X, /i, /). Therefore, the search for an optimal es- 
timator should encompass all these possible representations (as in Example 
3.8). As a side remark, we should stress that the very notion of “optimality” of 
a representation is quite difficult to define precisely. Indeed, as already noted 
in Chapter 2, the comparison of simulation methods cannot be equated with 
the comparison of the variances of the resulting estimators. Conception and 
computation times should also be taken into account. At another level, note 
that the optimal method proposed in Theorem 3.12 depends on the function h 
involved in (3.4). Therefore, it cannot be considered as optimal when several 
integrals related to / are simultaneously evaluated. In such cases, which often 
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occur in Bayesian analysis, only generic methods can be compared (that is to 
say, those which are independent of h). 

The principal alternative to direct sampling from / for the evaluation of 
(3.4) is to use importance sampling, defined as follows: 

Definition 3.9. The method of importance sampling is an evaluation of (3.4) 
based on generating a sample Xi , . . . , Xn from a given distribution g and 
approximating 

(3^8) mm ] » E ^ '•(X,) ^ 

This method is based on the alternative representation of (3.4): 

(3.9) Ef[h{X)] = ^ h{x) ® g{x) dx , 

which is called the importance sampling fundamental identity^ and the esti- 
mator (3.8) converges to (3.4) for the same reason the regular Monte Carlo 
estimator hm converges, whatever the choice of the distribution g (as long as 
supp{g) D supp(/)). 

Note that (3.9) is a very general representation that expresses the fact 
that a given integral is not intrinsically associated with a given distribution. 
Example 3.8 shows how much of an effect this choice of representation can 
have. Importance sampling is therefore of considerable interest since it puts 
very little restriction on the choice of the instrumental distribution p, which 
can be chosen from distributions that are easy to simulate. Moreover, the 
same sample (generated from g) can be used repeatedly, not only for different 
functions h but also for different densities /, a feature which is quite attractive 
for robustness and Bayesian sensitivity analyses. 

Example 3.10. Exponential and log-normal comparison. Consider X 
as an estimator of A, when X ~ Exp{l/X) or when X ~ £A/^(0,cr^) (with 
= A, see Problem 3.11). If the goal is to compare the performances of 
this estimator under both distributions for the scaled squared error loss 

L{X,S) = {6-X)yW 

a single sample from CJ\f{0, cr^), Xi, . . . , Xt^ can be used for both purposes, 
the risks being evaluated by 

Ri = ^ - A)^ 



in the exponential case and by 
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Exponential LOQriormal 





Fig. 3.4. Graph of approximate scaled squared error risks of X vs. A for an expo- 
nential and a log-normal observation, compared with the theoretical values (dashes) 
for A G [1,6] (10,000 simulations). 



t=l 

in the log-normal case. In addition, the scale nature of the parameterization 
allows a single sample . . . , Y^) from A/*(0, 1) to be used for all cr’s, with 
Xt = exp{aY^^). 

The comparison of these evaluations is given in Figure 3.4 for T = 10, 000, 
each point corresponding to a sample of size T simulated from £A/^(0,cr^) by 
the above transformation. The exact values are given by 1 and (A -f- 1)(A — 1), 
respectively. Note that implementing importance sampling in the opposite 
way offers little appeal since the weights exp{— log(Xt)^/2cr^} x exp{XXt)/Xt 
have infinite variance (see below). The graph of the risk in the exponential 
case is then more stable than for the original sample from the log-normal 
distribution. || 

We close this section by revisiting a previous example with a new twist. 

Example 3.11. Small tail probabilities. In Example 3.5 we calculated nor- 
mal tail probabilities with Monte Carlo sums, and found the method to work 
well. However, the method breaks down if we need to go too far into the 
tail. For example, if Z ~ jV*(0, 1), and we are interested in the probability 
P{Z > 4.5) (which we know is very small), we could simulate ^ A/’(0, 1) 
for z = 1, . . . M and calculate 

1 ^ 

P(Z>4.5)«-5^I(zW>4.5). 

i=l 

If we do this, a value of M = 10, 000 usually produces all zeros of the indicator 
function. 






94 



3 Monte Carlo Integration 



Of course, the problem is that we are calculating the probability of a very 
rare event, and naive simulation will need a lot of iterations to get a reason- 
able answer. However, with importance sampling we can greatly improve our 
accuracy. 

Let V ~ Tf(4.5, 1), an exponential distribution (left) truncated at 4.5 
with scale 1, with density 

roo 

/y(2/) = / e-^dx. 

J4.5 

If we now simulate from fy and use importance sampling, we obtain (see 
Problem 3.16) 

P{Z > 4.5) « t 7 y > 4-5) = .000003377. 

M ^ fviYM) ^ ’ II 



3.3.2 Finite Variance Estimators 



Although the distribution g can be almost any density for the estimator (3.8) 
to converge, there are obviously some choices that are better than others, and 
it is natural to try to compare different distributions g for the evaluation of 
(3.4). First, note that, while (3.8) does converge almost surely to (3.4), its 
variance is finite only when the expectation 






h\X) 



fjX) 

9W\ 



= Ef 



h^{X) 



fjX) 

g{x) 



-L 



h?‘{x) 



IM 

9{x) 



dx < oo 



Thus, instrumental distributions with tails lighter than those of / (that is, 
those with unbounded ratios f/g) are not appropriate for importance sam- 
pling. In fact, in these cases, the variances of the corresponding estimators 
(3.8) will be infinite for many functions h. More generally, if the ratio f/g 
is unbounded, the weights f{xj)/g{xj) will vary widely, giving too much im- 
portance to a few values Xj. This means that the estimator (3.8) may change 
abruptly from one iteration to the next one, even after many iterations. Con- 
versely, distributions g with thicker tails than / ensure that the ratio f/g does 
not cause the divergence of E/[^^//^]. In particular, Geweke (1989) mentions 
two types of sufficient conditions: 

(a) f{x)/g{x) < M Vx G A" and var/(/i) < oo ; 

(b) X is compact, f{x) < F and g{x) > e Vx G A. 

These conditions are quite restrictive. In particular, f /g < M implies that 
the Accept-Reject algorithm [AA] also applies. (A comparison between the 
two approaches is given in Section 3.3.3.) 

An alternative to (3.8) which addresses the finite variance issue, and gen- 
erally yields a more stable estimator, is to use 
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^ ^ ^ E7=1 ’ 

where we have replaced m with the sum of the weights. Since (1/m) J^jLi 
f{xj)/g{xj) converges to 1 as m — » cxd, this estimator also converges to 
Efh{X) by the Strong Law of Large Numbers.. Although this estimator is 
biased, the bias is small, and the improvement in variance makes it a pre- 
ferred alternative to (3.8) (see also Lemma 4.3). In fact, Casella and Robert 
(1998) have shown that the weighted estimator (3.10) may perform better 
(when evaluated under squared error loss) in some settings. (See also Van Dijk 
and Kloeck 1984.) For instance, when h is nearly constant, (3.10) is close to 
this value, while (3.8) has a higher variation since the sum of the weights is 
different from one. 

Among the distributions g leading to finite variances for the estimator 
(3.8), it is, in fact, possible to exhibit the optimal distribution corresponding 
to a given function h and a fixed distribution /, as stated by the following 
result of Rubinstein (1981); see also Geweke (1989). 



Theorem 3.12. The choice of g that minimizes the variance of the estimator 
(3.8) is 



9*{x) = 



\h{x)\ fix) 

Ix l^(^)l /(^) ' 



Proof. First note that 



var 



\h{X)f(X)] 


= En 


\h^{X)f{X)] 


- \Ea 


\h{X)f{X)] 


9{X) \ 




9^{X) \ 




9{X) J 



2 



and the second term does not depend on g. So, to minimize variance, we only 
need minimize the first term. From Jensen’s inequality it follows that 



E 



9 



\hHx)f{x)] 


> ( ¥ 


\\h{X)\f{X)] 


9^{X) \ 


— \ ^9 


9{X) \ 






\h{x)\f{x)dx 



which provides a lower bound that is independent of the choice of g. It is 
straightforward to verify that this lower bound is attained by choosing g = g’^. 

□ 



This optimality result is rather formal since, when h{x) > 0, the optimal 
choice g*{x) requires the knowledge of f h{x)f{x)dx, the integral of interest! A 
practical alternative taking advantage of Theorem 3.12 is to use the estimator 
(3.10) as 



E^i Hxj) f{xj)/g{xj) _ h{xj)\h{xj)\ ^ 

Er=i fi^j)/9i^j) ~ Er=i 



(3.11) 
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where xj g (x \h\f. Note that the numerator is the number of times h{xj) 
is positive minus the number of times it is negative. In particular, when h 
is positive, (3.11) is the harmonic mean. Unfortunately, the optimality of 
Theorem 3.12 does not transfer to (3.11), which is biased and may exhibit 
severe instability.^ 

Prom a practical point of view. Theorem 3.12 suggests looking for distribu- 
tions g for which \h\f /g is almost constant with finite variance. It is important 
to note that although the finite variance constraint is not necessary for the 
convergence of (3.8) and of (3.11), importance sampling performs quite poorly 
when 

(3.12) [ ^ 7 ^ dx = +oo, 

J g{x) 



whether in terms of behavior of the estimator (high- amplitude jumps, instabil- 
ity of the path of the average, slow convergence) or of comparison with direct 
Monte Carlo methods. Distributions g such that (3.12) occurs are therefore 
not recommended. 

The next two examples show that importance sampling methods can bring 
considerable improvement over naive Monte Carlo estimates when imple- 
mented with care. However, they can encounter disastrous performances and 
produce extremely poor estimates when the variance conditions are not met. 



Example 3.13. Student’s t distribution. Consider X ~ T(z/, ^,cr^), with 
density 



fix) = 



n(^+i)/2) 

(jy/Un r{v/2) 



1 + 



ua‘^ 



-(i/+l)/2 



Without loss of generality, we take ^ = 0 and cr = 1. We choose the quantities 
of interest to be Ef[hi{X)] {i = 1, 2, 3), with 



hi{x) 



X 

1-x ’ 



h 2 {x) = x^I[ 2 .i,oo[(a;), hs{x) = ^ _ 3^ 2 • 



Obviously, it is possible to generate directly from /. Importance sampling 
alternatives are associated here with a Cauchy C(0, 1) distribution and a nor- 
mal — 2)) distribution (scaled so that the variance is the same as 

T(i/, cr^)). The choice of the normal distribution is not expected to be effi- 

cient, as the ratio 

f{x) ^ ^x'^{v- 2 )! 2 v 

does not have a finite integral. However, this will give us an opportunity to 
study the performance of importance sampling in such a situation. On the 



^ In fact, the optimality only applies to the numerator, while another sequence 
should be used to better approximate the denominator. 
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Fig. 3.5. Empirical range of three series of estimators of Ef[\X/{l — X)\^^^] for 
u = 12 and 500 replications: sampling from / (left), importance sampling with a 
Cauchy instrumental distribution (center) and importance sampling with normal 
importance distribution (right). Average of the 500 series in overlay. 



other hand, the C(0, 1) distribution has larger tails than / and ensures that 
the variance of f/g is finite. 

Figure 3.5 illustrates the performances of the three corresponding estima- 
tors for the function hi when u = 12 by representing the range of 500 series 
over 2000 iterations. The average of these series is quite stable over iterations 
and does not depend on the choice of the importance function, while the range 
exhibits wide jumps for all three. This phenomenon is due to the fact that the 
function hi has a singularity ai x = 1 such that h\ is not integrable under / 
but also such that none of the two other importance sampling estimators has 
a finite variance (Problem 3.20)! Were we to repeat this experiment with 5000 
series rather than 500 series, we would then see larger ranges. There is thus 
no possible comparison between the three proposals in this case, since they 
all are inefficient. An alternative choice devised purposely for this function hi 
is to choose g such that {1 — x)g{x) is better behaved in x = 1. If we take for 
instance the double Gamma distribution folded at 1, that is, the distribution 
of X symmetric around 1 such that 

\X-l\r^ga{a,l), 



the ratio 

is integrable around x = 1 when o; < 1. Obviously, the exponential part 
creates problems at oo and leads once more to an infinite variance, but it has 
much less influence on the stability of the estimator, as shown in Figure 3.6. 

Since both /12 and /is have restricted supports, we could benefit by having 
the instrumental distributions take this information into account. In the case 
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Fig. 3.6. Empirical range of the importance sampling estimator of E/[|X/(1 — 
X)\^^‘^] for 1 / = 12 and 500 replications based on the double Gamma Qa{a, 1) distri- 
bution folded at 1 when a = .5. Average of the 500 series in overlay. 



of /i 2 , a uniform distribution on [0, 1/2.1] is reasonable, since the expectation 
Ef[h 2 {X)] can be written as 



nl/2.1 1 

/ u~'^f{l/u)du=— 2.1 u~'^ f{l/u) du , 

Jo Jo 

as in Example 3.8. The corresponding importance sampling estimator is then 



fl/2.1 






where the Uj's are iid ZY([0, 1/2.1]). Figure 3.7 shows the improvement brought 
by this choice, with the estimator 62 converging to the true value after only a 
few hundred iterations. The importance sampling estimator associated with 
the Cauchy distribution is also quite stable, but it requires more iterations 
to achieve the same precision. Both of the other estimators (which are based 
on the true distribution and the normal distribution, respectively) fluctuate 
around the exact value with high- amplitude jumps, because their variance is 
infinite. 

In the case of /i 3 , a reasonable candidate for the instrumental distribution 
is g{x) = exp(— x)Ia; > 0, leading to the estimation of 



pOO 

E/N(V)] = / 

Jo 




1 + (x - 3)2 






1 + (x — 3)2 



/(x) dx 
f{x) e~^ dx 



by 
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Fig. 3.7. Convergence of four estimators of E/[X^Ix> 2 .i] for y = 12: Sampling 
from / (solid lines), importance sampling with Cauchy instrumental distribution 
(short dashes), importance sampling with uniform iY([0, 1/2.1]) instrumental distri- 
bution (long dashes) and importance sampling with normal instrumental distribu- 
tion (dots). The final values are respectively 6.75, 6.48, 6.57, and 7.06, for an exact 
value of 6.54. 



- m 

(3.13) h^{Xj)w{X^), 

'll' . . 

where the Xj^s are iid 8xp{l) and w{x) = /(x)exp(x). Figure 3.8 shows 
that, although this weight does not have a finite expectation under T(z/, 0, 1), 
meaning that the variance is infinite, the estimator (3.13) provides a good 
approximation of Kf[hs{X)], having the same order of precision as the esti- 
mation provided by the exact simulation, and greater stability. The estimator 
based on the Cauchy distribution is, as in the other case, stable, but its bias 
is, again, slow to vanish, and the estimator associated with the normal dis- 
tribution once more displays large fluctuations which considerably hinder its 
convergence. || 



Example 3.14. Transition matrix estimation. Consider a Markov chain 
with two states, 1 and 2, whose transition matrix is 



f Pi 1-Pi\ 
VI -P2 P2 ) 



that is. 



P(X,+i = 1|X, = 1) = 1 - P(X,+i = 2|X, = 1) = pi, 
P(X,+i - 2|X, = 2) = 1 - P{Xt+i = 1|X, = 2)=P2. 
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Fig. 3.8. Convergence of four estimators of Ef[h 3 {X)]: Sampling from / (solid 
lines), importance sampling with Cauchy instrumental distribution (short dashes), 
with normal instrumental distribution (dots), and with exponential instrumental 
distribution (long dashes). The final values after 50,000 iterations are respectively 
4.58, 4.42, 4.99, and 4.52, for a true value of 4.64. 



Assume, in addition, that the constraint pi P 2 < 1 holds (see Geweke 
1989 for a motivation related to continuous time processes). If the sample 
is Xi, . . . , Xm and the prior distribution is 



'^{Pl-)P2) — 2 Ipi+P2<l ? 
the posterior distribution of (^ 1 ,^ 2 ) is 
7r(pi,P2|mii,mi2,m2i,m22) ocp™“(l V+P2<i > 

where rriij is the number of passages from i to j, that is, 

m 

TUij = Ixt=i^xt+i=j ? 
t=2 

and it follows that V = (mu, . . . , 77122 ) is a sufficient statistic. 

Suppose now that the quantities of interest are the posterior expectations 
of the probabilities and the associated odds: 

Pi 

hl{pi,P2) =Pl, h2{pi,P2) =P2, hs{pi,p2) = 

i — Pi 

and 

'•‘(j’-K) = '*»<«'»> = ‘o * ’ 

respectively. 

We now look at a number of ways in which to calculate these posterior 
expectations. 
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(i) The distribution 7r{pi,p2\'D) is the restriction of the product of two dis- 
tributions Be{mii + 1, mi 2 + 1) and Be{m 22 + 1, ^21 + 1) to the simplex 
{{pi’>P2) ‘ Pi -\-p2 < 1}. So a reasonable first approach is to simulate these 
two distributions until the sum of two realizations is less than 1. Unfor- 
tunately, this naive strategy is rather inefficient since, for the given data 
(mil, mi 2 , 77221, m 22 ) = (68,28,17,4) we have P^{p\ +P 2 < 1|T>) = 0.21 
(Geweke 1989). The importance sampling alternatives are to simulate dis- 
tributions which are restricted to the simplex. 

(ii) A solution inspired from the shape of T^{pi^P 2 \P) is a Dirichlet distribution 

-h 1, m 22 + 1, mi 2 -h rri 2 i + 1), with density 

7ri(pi,p2|p) - Pi . 

However, the ratio 7r{pi, p 2 \T>) / 7Ti{pi, p 2 \V) is not bounded and the cor- 
responding variance is infinite. 

(iii) Geweke ’s (1989) proposal is to use the normal approximation to the bi- 
nomial distribution, that is, 

7T2(Pi,P2|I’) oc exp{-(mii +mi 2 )(pi - pi)^/2 pi(l - pi)} 

X exp{-(m 2 i + m22){P2 ~ P2)^/2 P2(l - P 2 )} ^Pl+P2<l 7 

where pi is the maximum likelihood estimator of pi^ that is, majima + 
mi(s-i))‘ An efficient way to simulate 7T2 is then to simulate pi from the 
normal distribution A/*(pi,pi(l — Pi)/(mi 2 + rnu)) restricted to [0,1], 
then p 2 from the normal distribution A/’(p 2 ,:P 2 (l — P 2 )/(^ 2 i + ^ 22 )) re- 
stricted to [0, 1 — pi], using the method proposed by Geweke (1991) and 
Robert (1995b). The ratio 7r/7T2 then has a finite expectation under tt, 
since (pi,P2) is restricted to {(pi,p2) ^ Pi +P2 < !}• 

(iv) Another possibility is to keep the distribution B{mu -j- 1, mi 2 + 1) as the 
marginal distribution on pi and to modify the conditional distribution 
p^“(l -P 2 )”*"‘ Ip 2 <i-pi into 

2 

7T3(P2|Pl,r') = (l-p^)2 P2 Ip2<l-Pi • 

The ratio w{pi,p 2 ) oc — P2)™^^(1 — Pi)^ is then bounded in 

(Pl,P2). 

Table 3.4 provides the estimators of the posterior expectations of the func- 
tions hj evaluated for the true distribution tt (simulated the naive way, that 
is, until Pi +P 2 < 1) and for the three instrumental distributions tti, 7 T 2 and 
7 T 3 . The distribution 713 is clearly preferable to the two other instrumental 
distributions since it provides the same estimation as the true distribution, at 
a lower computational cost. Note that tti does worse in all cases. 

Figure 3.9 describes the evolution of the estimators (3.10) of E[h^] as 
m increases for the three instrumental distributions considered. Similarly to 
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Distribution 


hi 


h2 


h3 


/14 


hs 


7Ti 


0.748 


0.139 


3.184 


0.163 


2.957 


7T2 


0.689 


0.210 


2.319 


0.283 


2.211 


7T3 


0.697 


0.189 


2.379 


0.241 


2.358 


TT 


0.697 


0.189 


2.373 


0.240 


2.358 



Table 3.4. Comparison of the evaluations of Ef[hj] for the estimators (3.10) cor- 
responding to three instrumental distributions ivi and to the true distribution tt 
( 10,000 simulations). 



Table 3.4, it shows the improvement brought by the distribution tts upon the 
alternative distributions, since the precision is of the same order as the true 
distribution, for a significantly lower simulation cost. The jumps in the graphs 
of the estimators associated with 7T2 and, especially, with tti are characteristic 
of importance sampling estimators with infinite variance. || 




Fig. 3.9. Convergence of four estimators of E/[/i5(X)] for the true distribution tt 
(solid lines) and for the instrumental distributions tti (dots), 7T2 (long dashes), and 
7T3 (short dashes). The final values after 10,000 iterations are 2.373, 3.184, 2.319, 
and 2.379, respectively. 



We therefore see that importance sampling cannot be applied blindly. 
Rather, care must be taken in choosing an instrumental density as the al- 
most sure convergence of (3.8) is only formal (in the sense that it may require 
an enormous number of simulations to produce an accurate approximation of 
the quantity of interest). These words of caution are meant to make the user 
aware of the problems that might be encountered if importance sampling is 
used when Ef[\f{X)/g{X)\] is infinite. (When Ef[f{X)/g{X)] is finite, the 
stakes are not so high, as convergence is more easily attained.) If the issue of 
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finiteness of the variance is ignored, and not detected, it may result in strong 
biases. For example, it can happen that the obvious divergence behavior of the 
previous examples does not occur. Thus, other measures, such as monitoring 
of the range of the weights f{Xi)/g{Xi) (which are of mean 1 in all cases), 
can help to detect convergence problems. (See also Note 4.6.1.) 

The finiteness of the ratio Kf[f{X)/g{X)] can be achieved by substituting 
a mixture distribution for the density 

(3.14) pg{x) + (l- p)i{x), 

where p is close to 1 and i is chosen for its heavy tails (for instance, a Cauchy 
or a Pareto distribution) . Prom an operational point of view, this means that 
the observations are generated with probability p from g and with probability 
I — p from However, the mixture {g versus i) does not play a role in the 
computation of the importance weights; that is, by construction, the estima- 
tor integrates out the uniform variable used to decide between g and (We 
discuss in detail such a marginalization perspective in Section 4.2, where uni- 
form variables involved in the simulation are integrated out in the estimator.) 
Obviously, (3.14) replaces g{x) in the weights of (3.8) or (3.11), which can 
then ensure a finite variance for integrable functions h?. Hesterberg (1998) 
studies the performances of this approach, called a defensive mixture. 



3.3.3 Comparing Importance Sampling with Accept-Reject 

Theorem^ 3.12 formally solves the problem of comparing Accept-Reject and 
importance sampling methods, since with the exception of the constant func- 
tions h{x) = ho, the optimal density g* is always different from /. However, 
a more realistic comparison should also take account of the fact that Theo- 
rem 3.12 is of limited applicability in a practical setup, as it prescribes an 
instrumental density that depends on the function h of interest. This may 
not only result in a considerable increase of the computation time for every 
new function h (especially if the resulting instrumental density is not easy 
to generate from), but it also eliminates the possibility of reusing the gener- 
ated sample to estimate a number of different quantities, as in Example 3.14. 
Now, when the Accept-Reject method is implemented with a density g sat- 
isfying f{x) < Mg{x) for a constant 1 < M < oo, the density g can serve as 
the instrumental density for importance sampling. A positive feature is that 
f/g is bounded, thus ensuring finiteness of the variance for the corresponding 
importance sampling estimators. Bear in mind, though, that in the Accept- 
Reject method the resulting sample, Ai, . . . , is a subsample of Ti, . . . , Y^, 
where the Yis are simulated from g and where t is the (random) number of 
simulations from g required for produce the n variables from /. 

^ This section contains more specialized material and may be omitted on a first 
reading. 
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To undertake a comparison of estimation using Accept-Reject and esti- 
mation using importance sampling, it is reasonable to start with the two 
traditional estimators 

(3.15) ^ E <^2 = 7 E • 

” i=i ^ i=i 



These estimators correspond to the straightforward utilization of the sample 
produced by Accept-Reject and to an importance sampling estimation derived 
from the overall sample, that is, to a recycling of the variables rejected by 
algorithm [A.4].^ If the ratio //^ is only known up to a constant, 82 can be 
replaced by 



,53 = E h{yj) 

3=^ 



f{Yj) 

g{Yj) 



V IM. 



If we write 82 in the more explicit form 




1 

n 



E 

i=l 



fjXi) 

9{Xi) 



t — n 1 
n t — n 



t—n 

E 

i=l 



f{Zj) 

g{Zi) 



where {Fi, . . . , FJ = {Xi, . . . , Xn} U {^ 1 , • • • , Zt-n] (the Z^’s being the vari- 
ables rejected by the Accept-Reject algorithm [A.4]), one might argue that, 
based on sample size, the variance of 82 is smaller than that of the estimator 



1 

n 



E 

i=l 



fjXi) 

g{Xi) ■ 



If we could apply Theorem 3.12, we could then conclude that this latter es- 
timator dominates 81 (for an appropriate choice of g) and, hence, that it 
is better to recycle the Z^’s than to discard them. Unfortunately, this rea- 
soning is flawed since t is a random variable, being the stopping rule of the 
Accept-Reject algorithm. The distribution of t is therefore a negative bino- 
mial distribution, J\feg{n, 1/M) (see Problem 2.30 ) so (Fi, . . . , Ft) is not an 
iid sample from g. (Note that the Y}’s corresponding to the X^’s, including 
Ft, have distribution /, whereas the others do not.) 

The comparison between and 82 can be reduced to comparing 81 = f{yt) 
and 82 for t ~ ^eo(l/M) and n = 1. However, even with this simplification, 
the comparison is quite involved (see Problem 3.34 for details), so a general 
comparison of the bias and variance of 82 with var/(/i(X)) is difficult (Casella 
and Robert 1998). 

While the estimator 82 is based on an incorrect representation of the dis- 
tribution of (Fi, . . . , Ft), a reasonable alternative based on the correct distri- 
bution of the sample is 

^ This obviously assumes a relatively tight control on the simulation methods rather 
than the use of a (black box) pseudo-random generation software, which only 
delivers the accepted variables. 
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(3.16) 



= 7 <^1 + 7 E 

i=i 



(M - l)f{Z,) 
Mg{Zj) - f{Zj) 



where the Zj’s are the elements of (Yi, . . . , Yt) that have been rejected. This 
estimator is also unbiased and the comparison with can also be studied in 
the case n = 1; that is, through the comparison of the variances of h{Xi) and 
of (^ 4 , which now can be written in the form 



(^4 



U(^i) + (l-p)i h{Zj) 



J = 1 




-1 



Assuming again that Ef[h{X)] = 0, the variance of 64 is 



var((S 4 ) = E 



t - 1 




Mg{x) - f{x) 



dx + ^Ef[h\X)] 



which is again too case-specific (that is, too dependent on /, and h) to 
allow for a general comparison. 

The marginal distribution of the Z^’s from the Accept-Reject algorithm 
is {Mg — f)/{M — 1), and the importance sampling estimator ^5 associated 
with this instrumental distribution is 



^5 



1 



t — n 



t—n 

E 



j=i 



(M - l)fjZj) 
Mg{Zj) - f{Zj) 



h{Zj), 



which allows us to write 64 as 



c- r ^ ~ 



n 



<^5, 



a weighted average of the usual Monte Carlo estimator and of ^ 5 . 

According to Theorem 3.12, the instrumental distribution can be chosen 
such that the variance of 6 ^ is lower than the variance of 61 . Since this esti- 
mator is unbiased, ^4 will dominate 61 for an appropriate choice of g. This 
domination result is of course as formal as Theorem 3.12, but it indicates that, 
for a fixed g, there exist functions h such that ^4 improves on Si. 

If / is only known up to the constant of integration (hence, / and M are 
not properly scaled), 64 can replaced by 



(3.17) 



n t-n*^ h{Zj)f{Zj) 
— E M,(Z,) -/(.,) 

fjZj) 

/ Mg{Zj) - f{Z^) ■ 



Although the above domination of 5i by ^4 does not extend to (5e, nonetheless, 
^6 correctly estimates constant functions while being asymptotically equiva- 
lent to 64. See Casella and Robert (1998) for additional domination results of 
(5i by weighted estimators. 




106 3 Monte Carlo Integration 



Example 3.15. Gamma simulation. For illustrative purposes, consider 
the simulation of Qa{a^(3) from the instrumental distribution ^a(a, 6), with 
a = [a] and b — a(3/a. (This choice of b is justified in Example 2.19 as max- 
imizing the acceptance probability in an Accept-Reject scheme.) The ratio 
f/g is therefore 



w{x) 



rja) ^ 
r{a) b^ 



which is bounded by 



(3.18) 



M = 



r(g) ^ ,-(«-a) 

r{a) 6“ \p-bj 

exp{a(log(a) - 1) - a(log(a) - 1)} . 



Since the ratio r{a)/r{a) is bounded from above by 1, an approximate bound 
that can be used in the simulation is 



M' = exp{a(log(a) - 1) - o(log(a) - 1)} , 

with M'/M = 1 + £ = r{a)/r{[a\). In this particular setup, the estimator 
S 4 is available since f/g and M are explicitly known. In order to assess the 
effect of the approximation (3.17), we also compute the estimator 6 q for the 
following functions of interest: 

hi{x) = h 2 {x) = xlogx , and hs{x) = . 

1 + X 




Fig. 3.10. Convergence of the estimators of E[A/(1 -t-X)], Si (solid lines), S 4 (dots) 
and Sg (dashes), for a = 3.7 and P = 1. The final values are respectively 0.7518, 
0.7495, and 0.7497, for a true value of the expectation equal to 0.7497. 
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Figure 3.10 describes the convergence of the three estimators of /i3 in m 
for a = 3.7 and /? = 1 (which yields an Accept-Reject acceptance probability 
of 1/M = .10). Both estimators (J4 and Sq have more stable graphs than 
the empirical average and they converge much faster to the theoretical 
expectation 0.7497, 6 q then being equal to this value after 6, 000 iterations. For 
a = 3.08 and (3 = 1 (which yields an Accept-Reject acceptance probability 
of 1/M = .78), Figure 3.11 illustrates the change of behavior of the three 
estimators of hs since they now converge at similar speeds. Note the proximity 
of S 4 and 5i, Sq again being the estimator closest to the theoretical expectation 
0.7081 after 10,000 iterations. 




Fig. 3.11. Convergence of estimators of E[X /{1 4-X)], (solid lines), S 4 (dots) 
and Sg (dashes) for a = 3.08 and P = 1. The final values are respectively 0.7087, 
0.7069, and 0.7084, for a true value of the expectation equal to 0.7081. 



Table 3.5 provides another evaluation of the three estimators in a case 
which is a priori very favorable to importance sampling, namely for a = 3.7. 
The table exhibits, in most cases, a strong domination of ^4 and over 61 
and a moderate domination of ^4 over 1 1 

In contrast to the general setup of Section 3.3, (^4 (or its approximation Sq) 
can always be used in an Accept-Reject sampling setup since this estimator 
does not require additional simulations. It provides a second evaluation of 
E/ [/i] , which can be compared with the Monte Carlo estimator for the purpose 
of convergence assessment. 



3.4 Laplace Approximations 

As an alternative to simulation of integrals, we can also attempt analytic ap- 
proximations. One of the oldest and most useful approximations is the integral 
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m 




100 






1000 






5 000 






5i 


<54 


<56 


<5i 


<54 


^6 


<5i 


54 


5e 


hi 


87.3 


55.9 


64.2 


36.5 


0.044 


0.047 


2.02 


0.54 


0.64 


h2 


1.6 


3.3 


4.4 


4.0 


0.00 


0.00 


0.17 


0.00 


0.00 


hz 


6.84 


0.11 


0.76 


4.73 


0.00 


0.00 


0.38 


0.02 


0.00 



Table 3.5. Comparison of the performances of the Monte Carlo estimator ((5i) 
with two importance sampling estimators ((^4 and Sq) under squared error loss after 
m iterations for a — 3.7 and /3 = 1. The squared error loss is multiplied by 10^ for 
the estimation of E[h 2 {X)] and by 10^ for the estimation of E[/i 3 (-^)]. The squared 
errors are actually the difference from the theoretical values (99.123, 5.3185, and 
0.7497, respectively) and the three estimators are based on the same unique sample, 
which explains the lack of monotonicity (in m) of the errors. (Source: Casella and 
Robert 1998.) 



Laplace approximation. It is based on the following argument: Suppose that 
we are interested in evaluating the integral 

(3.19) [ f(x\9)dx 

JA 



for a fixed value of 9. (The function / needs to be non-negative and integrable; 
see Tierney and Kadane 1986 and Tierney et al. 1989 for extensions.). Write 
f(x\9) = exp{nh(x\9)} ^ where n is the sample size or another parameter which 
can go to infinity, and use a Taylor series expansion of h(x\9) about a point 
xq to obtain 

h(x\9) ^ h(xo\9) + (x - xo)h'(xo\9) + 2^^^ h"(xo\9) 

(3.20) + Rn{x) , 



where we write 



h'(xo\9) = 



dh(x\9) 



dx 



and similarly for the other terms, while the remainder Rn(x) satisfies 



lim Rn(x)/(x — xo)^ = 0. 

X—^Xq 



Now choose xq — the value that satisfies h'(xe\9) = 0 and maximizes 
h(x\9) for the given value of 9. Then, the linear term in (3.20) is zero and we 
have the approximation 



L 



nh{xe\e) 



L 



2 



h''{xe\B) 3 ^^ h'" {xe\0) 



which is valid within a neighborhood of xe> (See Schervish 1995, Section 7.4.3, 
for detailed conditions.) Note the importance of choosing the point Xq to be 
a maximum. 
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The cubic term in the exponent is now expanded in a series around xq. 
Recall that the second order Taylor expansion of around 0 is 1 -h y -h 
^^/2!, and hence expanding exp{n{x - X0)^h"' {xe\9) /3\} around xe, we obtain 
the approximation 



and thus 



l+n^^^^^h'"{xe\0)+n^ 



{x - xe)^ 
2!(3!)2 



[h'"{xe\e)r 



/ ^nh{x 

Ja 

(3.21) X 



i 



~ 2^^ h"{xe\e) 



1 + n 



(a: - xef 

3! 



h"'{x0\9) + n‘ 



{X - X0^ 
2!(3!)2 



[h'"{x0\9)]^^Rn 






where Rn again denotes a remainder term. 

Excluding Rn, we call the integral approximations in (3.21) a first-order 
approximation if it includes only the first term in the right-hand side, a second- 
order approximation if it includes the first two terms; and a third- order ap- 
proximation if it includes all three terms. 

Since the above integrand is the kernel of a normal density with mean X0 
and variance —l/nh"{x0\9), we can evaluate these expressions further. More 
precisely, letting ^(-) denote the standard normal cdf, and taking A = [a, 6], 
we can evaluate the integral in the first-order approximation to obtain (see 
Problem 3.25) 



f 



^ ^nh{xe\0) 



2tt 

—nh''{x0\9) 



(3.22) X -nh'\x0\9){h - X0)] - ^[y/-nh"{x0\9){a - X6i)]| . 



Example 3.16. Gamma approximation. As a simple illustration of the 
Laplace approximation, consider estimating a Gamma Qa{a,l//3) integral, 
say 



(3.23) 







Here we have h{x) = — | -f (o — 1) log(x) with second order Taylor expansion 
(around a point xq) 



h{x) ^ h{xo) -h /i'(xo)(x - xo) + /i"(xo)^^— 

= -| + (a - 1) log(x„) + - i) (I - lo) - - x„)^ 

Choosing xq = X 0 = {a — 1)P (the mode of the density and maximizer of h) 
yields 
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h{x) « ^ + (a - 1) log(xe) + - ^8? 

Now substituting into (3.22) yields the Laplace approximation 



I 



b rfOc 1 / O— j.^2 



a r(a)/?° 



a — 1 






For a = 5 and /? = 2, = 8, and the approximation will be best in that 

area. In Table 3.6 we see that although the approximation is reasonable in 
the central region of the density, it becomes quite unacceptable in the tails. || 



Interval 


Approximation 


Exact 


(7,9) 


0.193351 


0.193341 


(6,10) 


0.375046 


0.37477 


(2,14) 


0.848559 


0.823349 


(15.987, oo) 


0.0224544 


0.100005 



Table 3.6. Laplace approximation of a Gamma integral for a = 5 and yd = 2. 



Thus, we see both the usefulness and the limits of the Laplace approxima- 
tion. In problems where Monte Carlo calculations are prohibitive because of 
computing time, the Laplace approximation can be useful as a guide to the 
solution of the problem. Also, the corresponding Taylor series can be used as 
a proposal density, which is particularly useful in problems where no obvious 
proposal exists. (See Example 7.12 for a similar situation.) 



3.5 Problems 



3.1 For the normal-Cauchy Bayes estimator 



6{x) = 



roo 0 

J—oo 1 + 0 ^ 



J—oo 



e-(x-9)2/2rf^ 



(a) Plot the integrand and use Monte Carlo integration to calculate the integral. 

(b) Monitor the convergence with the standard error of the estimate. Obtain 
three digits of accuracy with probability .95. 

3.2 (Continuation of Problem 3.1) 

(a) Use the Accept-Reject algorithm, with a Cauchy candidate, to generate a 
sample from the posterior distribution and calculate the estimator. 
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(b) Design a computer experiment to compare Monte Carlo error when using 
(i) the same random variables 9i in numerator and denominator, or (ii) 
different random variables. 

3.3 (a) For a standard normal random variable Z, calculate P(Z > 2.5) using 

Monte Carlo sums based on indicator functions. How many simulated ran- 
dom variables are needed to obtain three digits of accuracy? 

(b) Using Monte Carlo sums verify that if X ^ 5(1,1), P{X > 5.3) .005. 

Find the exact .995 cutoff to three digits of accuracy. 

3.4 (a) If X ~ A/’(0, cr^), show that 



E[e 






1 

a/20-2 + 1 ’ 



(b) Generalize to the case X V(/x,cr2). 

3.5 Referring to Example 3.6: 

(a) Verify the maximum of the likelihood ratio statistic. 

(b) Generate 5000 random variables according to (3.7), recreating the left panel 
of Figure 3.2. Compare this distribution to a null distribution where we fix 
null values of pi and p 2 , for example, (pi,p 2 ) = (.25, .75). For a range of 
values of (pi,p 2 ), compare the histograms both with the one from (3.7) and 
the Xi density. What can you conclude? 

3.6 An alternate analysis to that of Example 3.6 is to treat the contingency table 
as two binomial distributions, one for the patients receiving surgery and one for 
those receiving radiation. Then the test of hypothesis becomes a test of equality 
of the two binomial parameters. Repeat the analysis of the data in Table 3.2 
under the assumption of two binomials. Compare the results to those of Example 
3.6. 

3.7 A famous medical experiment was conducted by Joseph Lister in the late 1800s 
to examine the relationship between the use of a disinfectant, carbolic acid, and 
surgical success rates. The data are 



Disinfectant 





Yes 


No 


Success 


34 


19 


Failure 


6 


16 



Using the techniques of Example 3.6, analyze these data to examine the associ- 
ation between disinfectant and surgical success rates. Use both the multinomial 
model and the two-binomial model. 

3.8 Referring to Example 3.3, we calculate the expected value of S'^{x) from the pos- 
terior distribution 7t{6\x) oc ||^||~^ exp{ — ||x — ^||^/2} , arising from a normal 
likelihood and noninformative prior ||^||“^ (see Example 1.12). 

(a) Show that if the quadratic loss of Example 3.3 is normalized by 1/(2||^|P + 
p), the resulting Bayes estimator is 



S^{x) = E^ 



II^IP 

2||^P+p 



X, A 






1 

.2IW+P 



x,A . 



(b) Simulation of the posterior can be done by representing 6 in polar co- 
ordinates (p, (pi,(p 2 ) (p > 0, (pi G [-7t/2, 7t/2], (P2 G [— 7t/2, 7t/ 2]), with 
0 = (pcos(pi, psinpi cos(p 2 , psin^pi sin(p 2 ). If we denote ^ = 9/p, which 
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depends only on (<^1,(^2), show that ^ M{x • 1) and then inte- 

gration of p then leads to 

7r(v3i,<^2|x) oc exp{(a: • 4)^/2} sin(vJi), 

where x • ^ = xi cos((/?i) + X2 sin((^i) cos((/?2) + X3 sin((^i) sin((^2). 

(c) Show how to simulate from (/?2|x) using an Accept-Reject algorithm 

with instrumental function sin((/?i) exp{||x||^/2}. 

(d) For p = 3 and x = (0.1, 1.2, —0.7), demonstrate the convergence of the 
algorithm. Make plots of the iterations of the integral and its standard 
error. 

3.9 For the situation of Example 3.10, recreate Figure 3.4 using the following sim- 
ulation strategies with a sample size of 10, 000 points: 

(a) For each value of A, simulate a sample from the Sxp{l/X) distribution and 
a separate sample from the log-normal £A7(0, 2 log A) distribution. Plot the 
resulting risk functions. 

(b) For each value of A, simulate a sample from the Exp{l/X) distribution and 
then transform it into a sample from the £A^(0, 2 log A) distribution. Plot 
the resulting risk functions. 

(c) Simulate a sample from the Sxp{l) distribution. For each value of A, trans- 

form it into a sample from 8xp{l/X), and then transform it into a sample 
from the 2 log A) distribution. Plot the resulting risk functions. 

(d) Compare and comment on the accuracy of the plots. 

3.10 Compare (in a simulation experiment) the performances of the regular Monte 
Carlo estimator of 

r2 -x^/2 

/ = <^(2) - ^(1) 

with those of an estimator based on an optimal choice of instrumental distribu- 
tion (see (3.11)). 

3.11 In the setup of Example 3.10, give the two first moments of the log-normal 
distribution CJ\f{p,c7^). 

3.12 In the setup of Example 3.13, examine whether or not the different estimators 
of the expectations Ef[hi{X)] have finite variances. 

3.13 Establish the equality (3.18) using the representation b = /3a j a. 

3.14 (O Ruanaidh and Fitzgerald 1996) For simulating random variables from the 
density f{x) = exp{— yx}[sin(x)]^, 0 < x < 00, compare the following choices 
of instrumental densities: 

gi(x) = g 2 {x) = ^ sech^(a:/v^), 

Ssi^) = ^ i+i'2/4) 94{x) = -^e . 

(a) For M = 100, 1000, and 10, 000, compare the standard deviations of the 
estimates based on simulating M random variables. 

(b) For each of the instrumental densities, estimate the size of M needed to 
obtain three digits of accuracy in estimating E/A. 

3.15 Use the techniques of Example 3.11 to redo Problem 3.3. Compare the number 
of variables needed to obtain three digits of accuracy with importance sampling 
to the answers obtained from Problem 3.3. 

3.16 Referring to Example 3.11: 
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(a) Show that to simulate Y ^ TS (a, 1), an exponential distribution left trun- 
cated at a, we can simulate X ^ S(l) and take V = a -j- X. 

(b) Use this method to calculate the probability that a xi random variable is 
greater that 25, and that a ts random variable is greater than 50. 

(c) Explore the gain in efficiency from this method. Take a = 4.5 in part (a) 
and run an experiment to determine how many random variables would be 
needed to calculate P(Z > 4.5) to the same accuracy obtained from using 
100 random variables in an importance sampler. 

3.17 In this chapter, the importance sampling method is developed for an iid sample 

(Yi, . . . , Yn) from g. 

(a) Show that the importance sampling estimator is still unbiased if the Yi’s 
are correlated while being marginally distributed from g. 

(b) Show that the importance sampling estimator can be extended to the case 
when Yi is generated from a conditional distribution q(yi\Yi-i). 

(c) Implement a scheme based on an iid sample (Yi, Y3 , . . . , Yzn-i) and a sec- 
ondary sample (Y2, Y4 , . . . , Y2n) such that Y 2 i ~ q(y 2 ijY 2 i-i). Show that the 



covariance 



cov 



h(Y2i-i) 



f{Y2i-l) 

g{Y2i-i) 



,h(Y2i) 



f{Y2r) 

q{Y 2 i\Y 2 i-i) 



) 



is null. Generalize. 

3.18 For a sample (Yi, . . . , Yh) from g, the weights coi are defined as 



^ ^ f{Yi)/g{Yi) 



Show that the following algorithm (Rubin 1987) produces a sample from / such 
that the empirical average 

1 ^ 

M 2: 

m=l 

is asymptotically equivalent to the importance sampling estimator based on 
(Yi,...,Y^): 

For m = 1, . . . , M, 

take Xm = Yi with probability uji 

{Note: This is the SIR algorithm.) 

3.19 (Smith and Gelfand 1992) Show that, when evaluating an integral based on a 
posterior distribution 

'k{6\x) oc 'K{0)i{9\x), 

where tt is the prior distribution and ^ the likelihood function, the prior distri- 
bution can always be used as instrumental distribution (see Problem 2.29). 

(a) Show that the variance is finite when the likelihood is bounded. 

(b) Compare with choosing ^{0\x) as instrumental distribution when the likeli- 
hood is proportional to a density. {Hint: Consider the case of exponential 
families.) 

(c) Discuss the drawbacks of this (these) choice(s) in specific settings. 

(d) Show that a mixture between both instrumental distributions can ease some 
of the drawbacks. 

3.20 In the setting of Example 3.13, show that the variance of the importance 
sampling estimator associated with an importance function g and the integrand 
h{x) = \/xl{l — x) is infinite for all g's such that g{l) < 00. 
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3.21 Monte Carlo marginalization is a technique for calculating a marginal density 
when simulating from a joint density. Let (Xi,Yi) ~ fxvix^y), independent, 
and the corresponding marginal distribution fx{x) = f fxY{x,y)dy. 

(a) Let w{x) be an arbitrary density. Show that 



1 fxvjx* ^ yi)'fx{xj) 

™ n ^ fxY{xi,yi) 

1=1 



f f fxY{x,y)dxdy = fx(x*) 

J J fxY(x,y) 



and so we have a Monte Carlo estimate of fx, the marginal distribution of 
X, from only knowing the form of the joint distribution. 

(b) Let X\Y = y ~ Ga{y, 1) and Y ~ £xp{l). Use the technique of part (a) to 
plot the marginal density of X. Compare it to the exact marginal. 

(c) Choosing w{x) = fx\Y{x\y) works to produce the marginal distribution, 
and it is optimal. In the spirit of Theorem 3.12, can you prove this? 

3.22 Given a real importance sample X\,. .. ,X„ with importance function g and 

target density /, 

(a) show that the sum of the weights Wi = f{Xi)/g(Xi) is only equal to 1 m 
expectation and deduce that the weights need to be renormalized even when 
both densities have know normalizing constants. 

(b) Assuming that the weights uoi have been renormalized to sum to one, we 
sample, with replacement, n points Xj from the Xi’s using those weights. 
Show that the Xj's satisfy 



E 



1 

n 



n 



'^h(X^) 






n 

J2oJ^h{Xi) 

J=1 



(c) Deduce that, if the above formula is satisfied for Wi = f(Xi)/g(Xi) instead, 
the empirical distribution associated with the Xj's is unbiased. 

3.23 (Evans and Swartz 1995) Devise and implement a simulation experiment to 
approximate the probability P(Z 6 (0, oo)®) when Z ~ VeCO, X) and 



= diag(0, 1, 2, 3, 4, 5) + e • e‘. 



with e* = (1, 1, 1, 1, 1, 1): 

(a) when using the transform of a ^6(0, h) random variables; 

(b) when using the Choleski decomposition of i?; 

(c) when using a distribution restricted to (0, oo)® and importance sampling. 

3.24 Using the facts 






2,1 

y +- 



-cy'^/2 



' 5 , 15y 




derive expressions similar to (3.22) for the second- and third-order approxima- 
tions (see also Problem 5.6). 

3.25 By evaluating the normal integral for the first order approximation from (3.21), 
establish (3.22). 

3.26 Referring to Example 3.16, derive the Laplace approximation for the Gamma 
density and reproduce Table 3.6. 
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3.27 (Gelfand and Dey 1994) Consider a density function f{x\6) and a prior dis- 
tribution 7t{6) such that the marginal m{x) = f{x\6)7r{6)d0 is finite a.e. The 

marginal density is of use in the comparison of models since it appears in the 
Bayes factor (see Section 1.3). 

(a) Give a Laplace approximation of m and derive the corresponding approxi- 
mation of the Bayes factor. (See Tierney et al. 1989 for details.) 

(b) Give the general shape of an importance sampling approximation of m. 

(c) Detail this approximation when the importance function is the posterior 
distribution and when the normalizing constant is unknown. 

(d) Show that for a proper density r, 






and deduce that when the 0*’s are generated from the posterior, 

. 11 ^ r(er) I 

mi 



K t=i 



^ f{x\ei)7r{e:) 
is another importance sampling estimator of m. 



3.28 (Berger et al. 1998) For E a pxp positive-definite symmetric matrix, consider 
the distribution 



7r{6) oc 



exp (-(6> - ^)‘i: \0-fj,)/2) 



(a) Show that the distribution is well defined; that is, that 



/ 

JRP 



exp (-(61 - ^)‘r \0-fj.)/2) 



d6 < oo. 



(b) Show that an importance sampling implementation based on the normal 

instrumental distribution E) is not satisfactory from both theoretical 

and practical points of view. 

(c) Examine the alternative based on a Gamma distribution Qa(a^ (3) on p = 
\\9\f and a uniform distribution on the angles. 

Note: Priors such as these have been used to derive Bayes minimax estimators 
of a multivariate normal mean. See Lehmann and Casella (1998). 

3.29 From the Accept-Reject Algorithm we get a sequence Fi,l 2 , • • • of indepen- 
dent random variables generated from g along with a corresponding sequence 
t/i, t/ 2 , • . . of uniform random variables. For a fixed sample size t (i.e. for a fixed 
number of accepted random variables), the number of generated Ti’s is a random 
integer N. 

(a) Show that the joint distribution of (AT, Ti, . . . , Y}\r, C/i, . . . , Un) is given by 



P[N = n, Fi < t/i, . . . ,Tn < 2/n, Ul <Ui,...,Un < Un) 

/ Vn ryi rvn-i 

g{tn){Un A Wn)dtn I ’’f p(^l) • • • l) 

-00 J —00 J —00 



t-l 



' —00 
n — 1 



E hk 

(nr-- 3=t 



■ Wi.)~^dti • • -dtn-l, 



X 
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where Wi = f {Vi) / M g{yi) and the sum is over all subsets of {1, . . . , n — 1} 
of size t — 1. 

(b) There is also interest in the joint distribution of {Yi,Ui)\N = n, for any 
z = l,...,n— 1, aswe will see in Problem 4.17. Since this distribution is 
the same for each z, we can just derive it for (Ti, Ui). (Recall that Yn ~ /.) 
Show that 



P{N = n, Fi < y, Ui < ui) 

t-i 



^ fn-l\ 

“ \t-l J \m) V m) 

t — 1 , .A l\ n — t 

-{wi A ui) 1 - — H 

n-1 V M J n-1 



(«i - wi)+ (]^ J 



(c) Show that part (b) yields the negative binomial marginal distribution of N, 



P(N = n) = 



n — 1 
t - 1 



1 - 



the marginal distribution of Yi, m{y)^ 



m{y) = 



n — 1 n — 1 1 — P 



and 



P{Ui <w{y)\Yi^y,N = n)^ 



g{y)w{y)M^ 

m{y) 



3.30 If (Yi, . . . , Yn) is the sample produced by an Accept-Reject method based on 
(/, ^), where M sup(//p), (Xi, . . . , Xt) denotes the accepted subsample and 
(Zi, . . . , ZN-t) the rejected subsample. 

(a) Show that both 



(^2 



1 

N -t 



N-t 



E 



(M-l)/(Zi) 
Mg{Zi) - f{Zi) 



and 

i=l 



are unbiased estimators of 7 = Ef[h{X)] (when N > t). 

(b) Show that and S 2 are independent. 

(c) Determine the optimal weight (3* in S 3 = pSi-{-(l—/ 3 )S 2 in terms of variance. 
{Note: f3 may depend on N but not on (Yi, . . . , Yn)-) 

3.31 Given a sample Zi, . . . , Zn+t produced by an Accept-Reject algorithm to ac- 
cept n values, based on (/, M), show that the distribution of a rejected variable 

is 



U_j^ 
V Mg{z) 



g{z) 



g{z) - pf{z) 
1 - p 



where p = 1/M, that the marginal distribution of Z^ (i < n + 1) is 



^-1 ./ X , t 9{z) - Pfjz) 

n + t-r ^ ^ n + t-1 1-/9 



fm{z) — 
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and that the joint distribution of (Zi, Zj) (1 < z ^ j < n + t) is 
(n — l)(n — 2) 



(n + t — l)(n + t — 2) 

, (n - 1)< 



f(zi)f(zj) 



fiZi) 



gjzj) - pf{zj) g{zi) - pfjzi) 



(n + t- l)(n + t-2) 1-p l-p 

n{n - 1) g{zi) - pf{zi) g{zj) - pf{zj) 



{n + t — l)(n 4- 1 — 2) 



l-p 



l-p 




3.32 (Continuation of Problem 3.31) If Zi, . . . , Zn+t is the sample produced by an 
Accept-Reject algorithm to generate n values, based on show that 

the Zi’s are negatively correlated in the sense that for every square integrable 
function /i, 



cov{h{Yi),h{Yj)) = -Ej[/i]^EAT 



(n — l)2(n — 2) 



= ~^g[hf{p\Fi{t - 1, t - 1; t - 1; 1 - p) - p^}, 



where 2 -Fi(a, 5; c; z) is the confluent hypergeometric function (see Abramowitz 
and Stegun 1964 or Problem 1.38). 

3.33 Given an Accept-Reject algorithm based on (/, p,p), we denote by 



Kvj) = 



(1 -p)/(yj) 

g{yj) - pfivj) 



the importance sampling weight of the rejected variables (li , . . . ,Yt), and by 
(Xi, . . . , Xn) the accepted variables. 

(a) Show that the estimator 






n 

n-\-t 






t 

n + t 






with 



and 



6o = jj2b{Yj)h{Yj) 

j = l 

^ 1 ^ 

n 



does not uniformly dominate 

(b) Show that 



{Hint: Consider the constant functions.) 



^2w 



n + t 



^An + 



n + t 



^ h{Y^)h{Y^)/Y, b{Yj) 



j = l 



J = 1 



is asymptotically equivalent to in terms of bias and variance, 

(c) Deduce that S 2 w asymptotically dominates if (4.20) holds. 

3.34 For the Accept-Reject algorithm of Section 3.3.3: 
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(a) Show that conditionally on t, the joint density of (li, . . . , Yt) is indeed 



n 



( Mgjyj) - fivj) 



M-l 



fivt) 



and the expectation of 82 of (3.15) is given by 



E 



t - 1 



M 



M-l 



Ef[h(X)] 



1 



M - 1 



E/ 



h{X) 



Ml 

9(X) 



]} 



+- % 



h{X) 



M) 






(b) If we denote the acceptance probability of the Accept-Reject algorithm by 
p = 1/M and assume E/[/i(A)] = 0, show that the bias of S 2 is 



l-p 



E[r 



1 - p 



E/ 



h{X) 



fiX) 



5WJ ■ 



(c) Establish that for t ~ Qeo{p), E[t *] = — plog(p)/(l — p), and that the bias 
of 82 can be written as 



l-p 



(1 +log(p)) E/ 



h{X) 



Ml 



(d) Assuming that Ef[h{X)] = 0, show that the variance of 62 is 



t - 1 



f2 

+E 



l-p 



E/ 



hlX)^ 



9{X)\ 



'1 r. p(t-i) \ 
UM l-p / 



vB.Tf{h(X)fiX)/g{X)). 



3.35 Using the information from Note 3.6.1, for a binomial experiment Xn B{n,p) 
with p = 10“®, determine the minimum sample size n so that 

Xn 



< ep > .95 



when € = 10“^, 10“^, and 10“^. 

3.36 When random variables Yi are generated from (3.25), show that is dis- 
tributed as A(^o)”’^ exp(— n^J). Deduce that (3.26) is unbiased. 

3.37 Starting with a density / of interest, we create the exponential family 



^ = {/(-k); = exp[rx - K(r)]f{x)} , 

where K{r) is the cumulant generating function of / given in Section 3.6.2. It 
immediately follows that if Xi,X 2 , . . . ,Xn are iid from f{x\r), the density of 
X is 



(3.24) fx(x\r) = exp{n[rx - K{r)]} fx{x), 

where fx (a:) is the density of the average of an iid sample from / 
(a) Show that f{x\r) is a density. 
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(b) If Xi,X 2 , . . . ,Xn is an iid sample from /(x), and fx{x) is the density of 

the sample mean, show that f fx(x)dx = 1 and hence fx{x\r) 

of (3.24) is a density. 

(c) Show that the mgf of fx{x\r) is 

(d) In (3.29), for each x we choose r so that /Xr, the mean of /x(^k)? satisfies 
/Xr = X. Show that this value of r is the solution to the equation K'{r) = x. 

3.38 For the situation of Example 3.18: 

(a) Verify the mgf in (3.32). 

(b) Show that the solution to the saddlepoint equation is given by (3.33). 

(c) Plot the saddlepoint density for p = 7 and n = 1,5,20. Compare your 
results to the exact density. 

3.6 Notes 

3.6.1 Large Deviations Techniques 

When we introduced importance sampling methods in Section 3.3, we showed in 
Example 3.8 that alternatives to direct sampling were preferable when sampling from 
the tails of a distribution /. When the event A is particularly rare, say P{A) < 10“^, 
methods like importance sampling are needed to get an acceptable approximation 
(see Problem 3.35). Since the optimal choice given in Theorem 3.12 is formal, in 
the sense that it involves the unknown constant /, more practical choices have been 
proposed in the literature. In particular, Bucklew (1990) indicates how the theory 
of large deviations may help in devising proposal distributions in this purpose. 
Briefiy, the theory of large deviations is concerned with the approximation of 

tail probabilities P(\Xn - /x| > e) when Xn = {X\ H + Xn)/n is a mean of iid 

random variables, n goes to infinity, and e is large. (When e is small, the normal 
approximation based on the Central Limit Theorem works well enough.) 

If M{6) = E[exp(^Xi)] is the moment generating function of Xi and we define 
I{x) = sup^l^x — logM(^)}, the large deviation approximation is 

-\ogP{Sn e F) ft! - inf /(x). 
n F 

This result is sometimes called Cramer’s Theorem and a simulation device based on 
this result and called twisted simulation is as follows. 

To evaluate 

when E[/i(Vi)] < 0, we use the proposal density 
(3.25) t{x) (X f{x) exp{^o/i(x)} , 

where the parameter is chosen such that / h{x)f(x)e^^^^^^dx = 0. (Note the 
similarity with exponential tilting in saddlepoint approximations in Section 3.6.2.) 
The corresponding estimate of I is then based on blocks (m = 1, . . . , M) 

j(m) _ ^ 

i=l 
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where the are iid from t, as follows: 

1 ^ 

(3.26) = M E 

m— 1 

with A(^) = f f(x)e^^^^^^dx. The fact that (3.26) is unbounded follows from a 
regular importance sampling argument (Problem 3.36). Bucklew (1990) provides 
arguments about the fact that the variance of I goes to 0 exponentially twice as fast 
as the regular (direct sampling) estimate. 



Example 3.17. Laplace distribution. Consider h{x) = x and the sampling dis- 
tribution f{x) = ^ exp{ — |x — //|/a}, iJ. < 0. We then have 



t{x) (X exp{ — |x — fj,\/a + ^o}, 

Oo — y/ li~‘^ + 

2 

A(^o) = ^ exp(-C)a^(7 , 



with C = Y^l -t- ^ — 1. A large deviation computation then shows that (Bucklew 
1990, p. 139) 



lim — log(Mvar/) = 2 log A(^o), 

n n 



while the standard average I satisfies 



lim — log(MvarJ) = log A(^o) • 
n n 



Obviously, this is not the entire story. Further improvements can be found in 
the theory, while the computation of and A(^o) and simulation from t(x) may 
become quite intricate in realistic setups. 



3.6.2 The Saddlepoint Approximation 

The saddlepoint approximation, in contrast to the Laplace approximation, is mainly 
a technique for approximating a function rather than an integral, although it natu- 
rally leads to an integral approximation. (For introductions to the topic see Goutis 
and Casella 1999, the review papers of Reid 1988, 1991, or the books by Field and 
Ronchetti 1990, Kolassa 1994, or Jensen 1995.) 

Suppose we would like to evaluate 

(3.27) g{6) = [ f{x\6)dx 

J A 

for a range of values of 9. One interpretation of a saddlepoint approximation is that 
for each value of we do a Laplace approximation centered at xe (the saddlepoint).^ 

^ The saddlepoint approximation got its name because its original derivation 
(Daniels 1954) used a complex analysis argument, and the point xe is a sad- 
dlepoint in the complex plane. 
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One way to derive the saddlepoint approximation is to use an Edgeworth ex- 
pansion (see Hall 1992 or Reid 1988 for details). As a result of a quite detailed 
derivation, we obtain the approximation to the density of X to be 



(3.28) 




+ 0{l/n) 



Ignoring the term within braces produces the usual normal approximation, which is 
accurate to 0(1/ y/n). If we are using (3.28) for values of x near /x, then the value of 
the expression in braces is close to zero, and the approximation will then be accurate 
to (9(1 /n). The trick of the saddlepoint approximation is to make this always be the 
case. 

To do so, we use a family of densities such that, for each x, we can choose a 
density from the family to cancel the term in braces in (3.28). One method of creating 
such a family is through a technique known as exponential tilting (see Efron 1981, 
Stuart and Ord 1987, Section 11.13, Reid 1988, or Problem 3.37). The result of 
the exponential tilt is a family of Edgeworth expansions for /x(^) indexed by a 
parameter r, that is. 



fx{x) = exp{-n[ra: - K{t)]}^ip . 

(Jr yO’T/yJn 



(3.29) 



1 + V _ 3 

^ Q^\\ar/V^) 



X- Ht 
arls/n 



+ 0{l/n)\ 



As the parameter r is free for us to choose in (3.29), for each x we choose r = r(x) 
so that the mean satisfies pir = x. This choice cancels the middle term in the 
square brackets in (3.29), thereby improving the order of the approximation. If 
K(t) = log (Eexp(TA)) is the cumulant generating function, we can choose r so 
that K'{r) = x, which is the saddlepoint equation. Denoting this value by f = f(x) 
and noting that ctt == K"{r), we get the saddlepoint approximation 

fx{x) = ^<fi{0)exp{n[K{f) - tx]} [1 + 0{l/n)] 

( n \ 

(3.30) « \ 2^’'~(}{x)) ) • 

Example 3.18. Saddlepoint tail area approximation. The noncentral chi 
squared density has the rather complex form 



(3.31) 



^ p/2+fc-l -x/2 \fcp-A 

^ ^ r{p/2 + fc)2p/2+fc ~W~ 



where p is the number of degrees of freedom and A is the noncentrality parameter. 
It turns out that calculation of the moment generating function is simple, and it can 
be expressed in closed form as 






(3.32) 
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Solving the saddlepoint equation dlog(l)x{t)/dt = x yields the saddlepoint 



(3.33) 



i{x) = 



—p + 2x — \/p^ 8Aa: 
4x 



and applying (3.30) yields the approximate density (see Hougaard 1988 or Problem 
3.38). II 



The saddlepoint can also be used to approximate the tail area of a distribution. 
Prom (3.30), we have the approximation 



P{X > a) 



1/2 



^ /( ) (^) ' [Kx{t)Y^^exp{n[Kx{t)-tKx{t)]}dt, 



where we make the transformation K'x{t) = x and f(a) satisfies K'x(j{a)) = a. This 
transformation was noted by Daniels (1983, 1987) and allows the evaluation of the 
integral with only one saddlepoint evaluation. 



Interval 


Approximation Renormalized Exact 
approximation 


(36.225, oo) 


0.1012 


0.0996 


0.10 


(40.542, oo) 


0.0505 


0.0497 


0.05 


(49.333, oo) 


0.0101 


0.0099 


0.01 



Table 3.7. Saddlepoint approximation of a noncentral chi squared tail probability 
for p = 6 and A = 9. 



To examine the accuracy of the saddlepoint tail area, we return to the noncentral 
chi squared distribution of Example 3.18. Table 3.7 compares the tail areas calculated 
by integrating the exact density and using the regular and renormalized saddlepoint s. 
As can be seen, the accuracy is quite impressive. 

The discussion above shows only that the order of the approximation is 0{l/n), 
not the that is often claimed. This better error rate is obtained by renor- 

malizing (3.30) so that it integrates to 1. 

Saddlepoint approximations for tail areas have seen much more development 
than given here. For example, the work of Lugannani and Rice (1980) produced a 
very accurate approximation that only requires the evaluation of one saddlepoint and 
no integration. There are other approaches to tail area approximations; for example, 
the work of Barndorff-Nielsen (1991) using ancillary statistics or the Bayes-based ap- 
proximation of DiCiccio and Martin (1993). Wood et al. (1993) give generalizations 
of the Lugannani and Rice formula. 
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Controling Monte Carlo Variance 



The others regarded him uncertainly, none of them sure how he had arrived 
at such a conclusion or on how to refute it. 

— Susanna Gregory, A Deadly Brew 



In Chapter 3, the Monte Carlo method was introduced (and discussed) as a 
simulation-based approach to the approximation of complex integrals. There 
has been a considerable body of work in this area and, while not all of it is 
completely relevant for this book, in this chapter we discuss the specifics of 
variance estimation and control. These are fundamental concepts, and we will 
see connections with similar developments in the realm of MCMC algorithms 
that are discussed in Chapters 7-12. 



4.1 Monitoring Variation with the CLT 

In Chapter 3, we mentioned the use of the Central Limit Theorem for assessing 
the convergence of a Monte Carlo estimate, 

^ m 

hm = -Y. ~ /W > 

i=l 

to the integral of interest 

(4.1) J 

Figure 3.1 (right) was, for example, an illustration of the use of a normal con- 
fidence interval for this assessment. It also shows the limitation of a straight- 
forward application of the CLT to a sequence (hm) of estimates that are not 
independent. Thus, while a given slice (that is, for a given m) in Figure 3.1 
(right) indeed provides an asymptotically valid confidence interval, the enve- 
lope built over iterations and represented in this figure has no overall validity. 
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that is, another Monte Carlo sequence {hm) will not stay in this envelope with 
probability 0.95. To gather a valid assessment of convergence of Monte Carlo 
estimators, we need to either derive the joint distribution of the sequence 
{hm) or recover independence by running several sequences in parallel. The 
former is somewhat involved, but we will look at it in Section 4.1.2. The lat- 
ter is easier to derive and more widely applicable, but greedy in computing 
time. However, this last “property” is a feature we will meet repeatedly in 
the book, namely that validation of the assessment of variation is of an higher 
order than convergence of the estimator itself. Namely, this requires much 
more computing time than validation of the pointwise convergence (except in 
very special cases like regeneration). 

4.1.1 Univariate Monitoring 

In this section we look at monitoring methods that are univariate in nature. 
That is, the bounds placed on the estimate at iteration k depend on the values 
at time k and essentially ignore any correlation structure in the iterates. We 
begin with an example. 

Example 4.1. Monitoring with the CLT. When considering the evalu- 
ation of the integral of h{x) = [cos(50x) -h sin(20.x)]^ over [0,1], Figure 3.1 
(right) provides one convergence path with a standard error evaluation. As 
can be seen there, the resulting confidence band is moving over iterations in 
a rather noncoherent fashion, that is, the band exhibits the same “wiggles” 
as the point estimate.^ 

If, instead, we produce parallel sequences of estimates, we get the output 
summarized in Figure 4.1. The main point of this illustration is that the range 
and the empirical 90% band (derived from the set of estimates at each iteration 
by taking the empirical 5% and 95% quantiles) are much wider than the 95% 
confidence interval predicted by the CLT, where the variance was computed 
by averaging the empirical variances over the parallel sequences. || 

This simple example thus warns even further against the blind use of a 
normal approximation when repeatedly invoked over iterations with depen- 
dent estimators, simply because the normal confidence approximation only 
has a marginal and static validation. Using a band of estimators in parallel is 
obviously more costly but it provides the correct assessment on the variation 
of these estimators. 

Example 4.2. Cauchy prior. For the problem of estimating a normal mean, 
it is sometimes the case that a robust prior is desired (see, for example, Berger 
1985, Section 4.7). A degree of robustness can be achieved with a Cauchy prior, 
so we have the model 

^ This behavior is fairly natural when considering that, for each iteration, the con- 
fidence band is centered at the point estimate. 
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Fig. 4.1. Convergence of 1000 parallel sequences of Monte Carlo estimators of the 
integral of h{x) = [cos(50x) + sin(20x)]^: The straight line is the running average 
of 1000 estimates, the dotted line is the empirical 90% band and the dashed line is 
the normal approximation 95% confidence interval. The grey shading represents the 
range of the entire set of estimates. 



(4.2) 



X~V(6»,1), 6»~C(0,1)- 



Under squared error loss, the posterior mean is 

= II ■ 

Prom the form of (5^ (x) we see that we can simulate iid variables , • ' ' ? ~ 

A/’(x, 1) and calculate 



(4.4) 



~ X! 1 4.^/2 / X] 

i=l ^ ' i—1 



1 



The Law of Large Numbers implies that goes to S'^{x) as m goes to cxd, 

since both the numerator and the denominator are convergent (in m). || 



Note the “little problem” associated with this example: computing the 
variance of the estimator (x) directly “on the run” gets fairly complicated 
because the estimator is a ratio of estimators and the variance of a ratio is not 
the ratio of the variances! This is actually a problem of some importance in 
that ratios of estimators are very common, from importance sampling (where 
the weights are most often unnormalized) to the computation of Bayes factors 
in model choice (see Problems 4.1 and 4.2). 

Consider, thus, a ratio estimator, represented under the importance sam- 
pling form 
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U)j 



i=l 






without loss of generality. We assume that the xi ’s are realizations of random 
variables Xi ~ g{y)^ where ^ is a candidate distribution for target /. The o;i’s 
are realizations of random variables Wi such that E[Wi|Xi = x]= K,f{x)/g{x)^ 
K being an arbitrary constant (that corresponds to the lack of normalizing 
constants in / and g). We denote 



n n 

i=l 2=1 



(Note that we do not assume independence between the X^’s as in regular 
importance sampling.) Then, as shown in Liu (1996a) and Gasemyr (2002), 
the asymptotic variance of SJ^ can be derived in general: 

Lemma 4.3. The asymptotic variance of is 



yav{6^^) = (var(5^) - 2E^[h] cov(5^ 5^) + E-[hf var(5i”)) . 

n Kj 

Proof. As shown in Casella and Berger (2001), the variance of a ratio X/Y 
of random variables can be approximated by the delta method (also called 
Cramer- Wold’s theorem) as 



(4.5) 




var(X) 

E[y2] 



g E[X] 

E[y3] 



cov(x,y) + 



E[X2] 

E[y4] 



var(F) . 



The result then follows from straightforward computation (Problem 4.7). □ 



As in Liu (1996a), we can then deduce that, for the regular importance 
sampling estimator and the right degree of approximation, we have 

var(5^ — var^(h(X)) {1 + var^(W)} , 

which evaluates the additional degree of variability due to the denominator 
in the importance sampling ratio. (Of course, this is a rather crude approx- 
imation, as can be seen through the fact that this variance is always higher 
than var/(h(X)), which is the variance for an iid sample with the same size, 
and there exist choices of g and h where this does not hold!) 

Example 4.4 (Continuation of Example 4.2). If we generate normal 
A/*(x, 1) samples for the importance sampling approximation^ of in (4.3), 

^ The estimator (4.4) is, formally, an importance sampler because the target dis- 
tribution is the Cauchy. However, the calculation is just a straightforward Monte 
Carlo sum, generating the variables from A/^(x, 1). This illustrates that we can 
consider any Monte Carlo sum an importance sampling calculation. 
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Fig. 4.2. Convergence of 1000 parallel sequences of Monte Carlo estimators for the 
posterior mean in the Cauchy-Normal problem when x = 2.5: The straight line is 
the running average of 1000 estimates, the dotted line is the empirical 90% band 
and the dashed line is the normal approximation 95% confidence interval, using the 
variance approximation of Lemma 4.3. The grey shading represents the range of the 
entire set of estimates at each iteration. 



the variance approximation of Lemma 4.3 can be used to assess the variability 
of these estimates, but, again, the asymptotic nature of the approximation 
must be taken into account. Figure 4.2 compares the asymptotic variance 
(computed over 1,000 parallel sequences of estimators of 5'^{x) for x = 2.5) 
with the actual variation of the estimates, evaluated over the 1,000 parallel 
sequences. Even though the scales are not very different, there is once more a 
larger variability than predicted. 

Figure 4.3 reproduces this evaluation in the case where the 6iS are simu- 
lated from the prior C(0, 1) distribution and associated with the importance 
sampling estimate^ 

n / n 

exp{-(a;-0i)2/2} / ^exp {-(x - 6>j)^/2} . 

i=l ' i=l 

In this case, since the corresponding functions h are bounded for both choices, 
the variabilities of the estimates are quite similar, with a slight advantage to 
the normal sampling. || 



^ The inversion of the roles of the Af{x, 1) and C(0, 1) distributions illustrates once 
more both the ambiguity of the integral representation and the opportunities 
open by importance sampling. 
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Fig. 4.3. Same plot as Figure 4.2 when the ft’s are simulated from the prior C(0, 1) 
distribution. 



4.1.2 Multivariate Monitoring 

As mentioned in the introduction to this chapter, one valid method for attach- 
ing variances to a mean plot, and of having a valid Central Limit Theorem, 
is to derive the bounds using a multivariate approach. Although the entire 
calculation is not difficult, and is even enlightening, it does get a bit notation- 
intensive, like many multivariate calculations. 

Suppose that Xi,X 2 , ... is a sequence of independent (iid) random vari- 
ables that are simulated with the goal of estimating ji = E/(Ai). (Without 
loss of generality we work with the when we are typically interested in 
h{Xi) for some function h.) Define — (1/m) for m = 1, 2, . . . , n, 

where n is the number of random variables that we will simulate (typically a 
large number). A running mean plot^ something that we have already seen, is 
a plot of Xm against m, and our goal is to put valid error bars on this plot. 

For simplicity’s sake, we assume that Xi ~ A/*(0, cr^), independent, and 
want the distribution of the vector X = (Xi, X 2 , . . . , Xn). Since each element 
of this random variable has mean //, a simultaneous confidence interval, based 
on the multivariate normal distribution, will be a valid assessment of variance. 

Let 1 denote the n x 1 vector of ones. Then E[X] = 1/i. Moreover, it is 
straightforward to calculate 



cov(Xfc,XfcO = 



max{A:, /c'} 



It then follows that X ~ A/’n(lM, X), where 
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h) 



and 



(4.6) 






xl 



nFr 



if is known 
if cr^ is unknown, 



when we have an estimate ~ X^, independent of X. 

Our goal now is to make a running mean plot of (Xi, X 2 , . . . , Xn) and, at 
each value of attach error bars according to (4.6). At first, this seems like 
a calculational nightmare: Since n will typically be 5-20 thousand or more, 
we are in the position of inverting a huge matrix X a huge number of times. 
This could easily get prohibitive. 

However, it turns out that the inverse of X is not only computable in 
closed form, the elements can be computed with a recursion relation and the 
matrix X~^ is tridiagonal (Problem 4.9) and is given by 



(4.7) 






1 



/ 2 -2 0 0 0 

-2 8 -6 0 0 
0 -6 18 -12 0 
0 0 -12 32 -20 



0 

0 

0 

0 



\ 



V 0 0 0 



: : • • • -n(n - 1) I 

0 • • • —n{n — 1 ) j 



Finally, if we choose dn to be the appropriate Xn ^n,iy cutoff point, the 
confidence limits on ji are given by 



{/i : (x - ^(x - 1/x) < d} , 

and a bit of algebra will show that this is equivalent to 
(4.8) 

{/i : n/i^ - 2nXnfJ^ 4- x'r~^x < d} = i/a : /i e Xn ± )J ~ ^ 



To implement this procedure, we can plot the estimate of /i, 
(4.9) Xk ± \jx\ ^ ^ for A: = 1, 2, . . . , n. 



Figure 4.4 shows this plot, along with the univariate normal bands of Section 
4.1.1. The difference is striking, and shows that the univariate bands are 
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extremely overopt imistic. The more conservative multivariate bands give a 
much more accurate picture of the confidence in the convergence. 

Note that if when we implement (4.8) in a plot such as Figure 4.4, the width 
of the band is not dependent on k = 1,2, ... ,n, so will produce horizontal 
line with width equal to the final value. Instead, we suggest using (4.9), which 
gives the appropriate band for each value of k, showing the evolution of the 
band and equaling (4.8) at A: = n. 

This procedure, although dependent on the normality assumption, can 
be used as an approximation even when normality does not hold, with the 
gain from the more conservative procedure outweighing any loss due to the 
violation of normality. 




Fig. 4.4. Monte Carlo estimator of the integral of h{x) = [cos(50x) + sin(20x)]^ 
(solid line) . The narrow bands (grey shading) are the univariate normal approximate 
95% confidence interval, and the wider bands (lighter grey) are the multivariate 
bands of (4.9). 



4.2 Rao-Blackwellization 

An approach to reduce the variance of an estimator is to use the conditioning 
inequality 

var(E[(5(X)|V]) < var((5(A)) , 

sometimes called Rao-Blackwellization (Gelfand and Smith 1990; Liu et al. 
1994; Casella and Robert 1996) because the inequality is associated with the 
Rao-Blackwell Theorem (Lehmann and Casella 1998), although the condi- 
tioning is not always in terms of sufficient statistics. 
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In a simulation context, if is an estimator of3 = Ef[h{X)] and if X 
can be simulated from the joint distribution /(x, y) satisfying 

J f{x,y)dy = f{x), 

the estimator 5*(y) = E/[5(X)|F] dominates 6 in terms of variance (and 
in squared error loss, since the bias is the same). Obviously, this result only 
applies in settings where 5*(y) can be explicitly computed. 

Example 4.5. Student’s t expectation. Consider the expectation of 
h{x) = exp(— x^) when X ~ T(z/, cr^). The Student’s t distribution can be 

simulated as a mixture of a normal distribution and of a gamma distribution 
by Dickey’s decomposition (1968), 

X\y ^ J\f{ii,(j‘^y) and ~ Qa{y fl.v j2) X ~ 

Therefore, the empirical average 

^ m 

exp(-X^) 

3 =^ 



can be improved upon when the Xj are parts of the sample ((Xi,Fi), . . . , 
since 



(4.10) 



^ m ^ m 









is the conditional expectation when /x = 0 (see Problem 4.4). Figure 4.5 
provides an illustration of the difference of the convergences of 5m and to 
Ep[exp(— X^)] for (z^, /x, a) = (4.6,0, 1). For 5m to have the same precision as 
(5^ requires 10 times as many simulations. || 



Unfortunately, this conditioning method seems to enjoy a limited appli- 
cability since it involves a particular type of simulation (joint variables) and 
requires functions that are sufficiently regular for the conditional expectations 
to be explicit. 

There exists, however, a specific situation where Rao-Blackwellization is 
always possible.^ This is in the general setup of Accept-Reject methods, 
which are not always amenable to the other acceleration techniques mentioned 
later in Section 4.4.1 and Section 4.4.2. (Casella and Robert 1996 distinguish 
between parametric Rao-Blackwellization and nonparametric Rao-Blackwell- 
ization, the parametric version being more restrictive and only used in specific 
setups such as Gibbs sampling. See Section 7.6.2.) 

^ This part contains rather specialized material, which will not be used again in 
the book. It can be omitted at first reading. 
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Fig. 4.5. Convergence of the estimators of E[exp(— V^)], Sm (solid lines) and 
(dots) for (i/, /i, cr) = (4.6,0, 1). The final values are 0.5405 and 0.5369, respectively, 
for a true value equal to 0.5373. 



Consider an Accept-Reject method based on the instrumental distribu- 
tion g. If the original sample produced by the algorithm is (Xi, . . . ,X^), it 
can be associated with two iid samples, (t/i , . . . ^Un) and (Yi, . . . , Vat), with 
corresponding distributions ^o,l] and X is then the stopping time associ- 
ated with the acceptance of m variables Yj . An estimator of E/ [h] based on 
(Xi, . . . , Xm) can therefore be written 

. m 1 ^ 

<^1 = - E = - E . 

i=l j=l 

with Wj = f{Yj)/Mg{Yj). A reduction of the variance of 5i can be obtained 
by integrating out the t/^’s, which leads to the estimator 

N N 

<^2 = - E \N,Yu..., Yn\ h{Yj) = - E 

^ j=l ^ 1=1 

where, for z = 1, . . . , n — 1, satisfies 



Pi = V{Ui < Wi\N = n,Yi, ...,Yn) 



(4.11) 



= Wi 



E 









E( 






n—l 



while pn = The numerator sum is over all subsets of {l,...,z — l,z + 
1, . . . ,n — 1} of size m — 2, and the denominator sum is over all subsets 
of size m — 1. The resulting estimator S 2 is an average over all the possi- 
ble permutations of the realized sample, the permutations being weighted by 
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their probabilities. The Rao-Blackwellized estimator is then a function only 
of {N, Y(i), . . . , Yat), where Y(i), . . . , Y(a/'-i) are the order statistics. 

Although the computation of the p^’s may appear formidable, a recurrence 
relation of order can be used to calculate the estimator. Define, for k < 
m < n, 

k m 

Sk{m)= ^ n 

with {n, . . . ,im} = {1, . . . ,m}, Sk{m) = 0 for A: > m, and 5^(i) = Sk{i — 1). 
Then we can recursively calculate 

(4.12) Sk{m) = WmSk-i{m - 1) + (1 - Wm)Sk{m - 1), 

5fc(m) = - 1) + (1 - Wm)Sl{m - 1) 

and note that weight pi of (4.11) is given by 

Pi = Wi Sl_ 2 {n - l)/St-i{n - 1) {i < n). 

Note that, if the random nature of N and its dependence on the sample are 
ignored when taking the conditional expectation, this leads to the importance 
sampling estimator. 



N , N 

j=i ' j=i 

which does not necessarily improve upon (5i (see Section 3.3.3). 

Casella and Robert (1996) establish the following proposition, showing 
that 82 can be computed and dominates 5i. (The proof is left to Problem 
4.6.) 

Proposition 4.6. The estimator 82 = ^ Po ^iVj) dominates the esti- 

mator 81 under quadratic loss. 

The computation of the weights pi is obviously more costly than the deriva- 
tion of the weights of the importance sampling estimator ^3 or of the corrected 
estimator of Section 3.3.3. However, the recursive formula (4.12) leads to an 
overall simplification of the computation of the coefficients pi. Nonetheless, 
the increase in computing time can go as high as seven times (Casella and 
Robert 1996), but the corresponding variance decrease is even greater (80%). 

Example 4.7. (Continuation of Example 3.15) If we repeat the sim- 
ulation of Qa{a,f3) from the Qa{a^b) distribution, with a G N, it is of in- 
terest to compare the estimator obtained by Rao-Blackwellization, 82 , with 
the standard estimator based on the Accept-Reject sample and with the bi- 
ased importance sampling estimator 83 . Figure 4.6 illustrates the substantial 
improvement brought by conditioning, since 82 (uniformly) dominates both 
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Fig. 4.6. Comparisons of the errors 6 - E[/ii(^)] of the Accept-Reject estimator 
(long dashes), of the importance sampling estimator 63 (dots), and of the condi- 
tional version of Si, 62 (solid lines), for hi{x) = (top), h2(x) = xlog(x) (middle), 
and /is(a:) = x/{l x) (bottom) and a = 3.7, /3 = 1. The final errors are respec- 
tively -0.998, -0.982, and -0.077 (top), -0.053, -0.053, and -0.001 (middle), and 
-0.0075, -0.0074, and -0.00003 (bottom). 



alternatives for the observed simulations. Note also the strong similarity be- 
tween and the importance sampling estimator, the latter failing to bring 
any noticeable improvement in this setup. || 

For further discussion, estimators and examples see Casella (1996), Casella 
and Robert (1996). See also Perron (1999), who does a slightly different cal- 
culation, conditioning on N and the order statistics Y(i), . . . , in (4.11). 

4.3 Riemann Approximations 

In approximating an integral 3 like (4.1), the simulation-based approach is 
justified by probabilistic convergence result for the empirical average 

- m 

- E '■») . 

i=l 

when the X^’s are simulated according to /. As briefly mentioned in Section 
1.4, numerical integration (for one-dimensional integrals) is based on the an- 
alytical definition of the integral, namely as a limit of Riemann sums. In fact, 
for every sequence {ai^n)i (0 < i < n) such that ao,n = a, Un,n == b, and 
— CLi,n-i convcrgcs to 0 (in n), the (Riemann) sum 
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n— 1 
i=0 

converges to (4.1) as n goes to infinity. When A' has dimension greater than 
1, the same approximation applies with a grid on the domain A' (see Rudin 
1976 for more details.) 

When the two approaches are put together, the result is a Riemann sum 
with random steps, with the a^^n’s simulated from / (or from an instrumental 
distribution g). This method was first introduced by Yakowitz et al. (1978) 
as weighted Monte Carlo integration for uniform distributions on [0, 1]^. In 
a more general setup, we call this approach simulation by Riemann sums or 
Riemannian simulation^ following Philippe (1997a,b), although it is truly an 
integration method. 

Definition 4.8. The method of simulation by Riemann sums approximates 
the integral 3 by 

m— 1 

(4-13) /(X(,))(X(i+D - X(,)) , 

i=0 

where Xq, . . . , Xm are iid random variables from / and X(o) < * * • < are 
the order statistics associated with this sample. 

Suppose first that the integral 3 can be written 




and that /i is a differentiable function. We can then establish the following 
result about the validity of the Riemannian approximation. (See Problem 4.14 
for the proof.) 

Proposition 4.9. Let U = (C/q? Ui, . . . , Um) be an ordered sample from ^o,i] • 
If the derivative h! is bounded on [0, 1], the estimator 

m—l 

S{U) = Y, HUi){Ui+l - Ui) + h{0)Uo + h{Um){l - Um) 

i=0 

has a variance of order 

Yakowitz et al. (1978) improve on the order of the variance by symmetriz- 
ing 6 into 
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When the second derivative of h is bounded, the error of S is then of order 

(9(m-4). 

Even when the additional assumption on the second derivative is not sat- 
isfied, the practical improvement brought by Riemann sums (when compared 
with regular Monte Carlo integration) is substantial since the magnitude of 
the variance decreases from m~^ to Unfortunately, this dominance fails 
to extend to the case of multidimensional integrals, a phenomenon that is 
related to the so-called “curse of dimensionality”; that is, the subefficiency 
of numerical methods compared with simulation algorithms for dimensions 
d larger than 4 since the error is then of order (see Yakowitz et 

al. 1978). The intuitive reason behind this phenomenon is that a numerical 
approach like the Riemann sum method basically covers the entire space with 
a grid. When the dimension of the space increases, the number of points on 
the grid necessary to obtain a given precision increases too, which means, in 
practice, a much larger number of iterations for the same precision. 

The result of Proposition 4.9 holds for an arbitrary density, due to the 
property that the integral 3 can also be written as 

(4.14) [ H{x) dx , 

Jo 

where H{x) = h{F~{x)), and F~ is the generalized inverse of F, cdf of / 
(see Lemma 2.4). Although this is a formal representation when F~ is not 
available in closed form and cannot be used for simulation purposes in most 
cases (see Section 2.1.2), (4.14) is central to this extension of Proposition 4.9. 
Indeed, since 

A(,+i)-A(,)=F-(U,+i)-F-(U,), 

where the A(^)’s are the order statistics of a sample from F and the are 
the order statistics of a sample from W[o,i ]5 



m— 1 
2=0 

m—1 

= H{Ui)f{F-{Ui)){F-{Ui+i) - F-{Ui)) 

2=0 

m—1 

~ ^ H{Ui){Ui+, - Ui) , 

2=0 

given that (F~) (u) = 1/ f{F~{u)). Since the remainder is negligible in the 
first-order expansion of F~ — F~ (Ui) ^ the above Riemann sum can then 
be expressed in terms of uniform variables and Proposition 4.9 does apply in 
this setup, since the extreme terms h{0)Uo and h{Um){l — Um) are again 
of order m~‘^ (variancewise) . (See Philippe 1997a,b, for more details on the 
convergence of (4.13) to 3 under some conditions on the function h.) 
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The above results imply that the Riemann sums integration method will 
perform well in unidimensional setups when the density / is known. It thus 
provides an efficient alternative to standard Monte Carlo integration in this 
setting, since it does not require additional computations (although it requires 
keeping track of and storing all the X(q’s). Also, as the convergence is of a 
higher order, there is no difficulty in implementing the method. When / is 
known only up to a constant (that is, fo{x) oc /(x)), (4.13) can be replaced 
by 

ES' /o(^(i))(^(i+D-^w) ’ 

since both the numerator and the denominator almost surely converge. This 
approach thus provides, in addition, an efficient estimation method for the 
normalizing constant via the denominator in (4.15). Note also that when / 
is entirely known, the denominator converges to 1, which can be used as a 
convergence assessment device (see Philippe and Robert 2001, and Section 
12.2.4). 

Example 4.10. (Continuation of Example 3.15) When X ~ ^a(3.7, 1), 
assume that /i 2 (x) = x log(x) is the function of interest. A sample Xi, . . . , Xm 
from ^a(3.7, 1) can easily be produced by the algorithms [A.14] or [A. 15] of 
Chapter 2 and we compare the empirical mean, Sim, with the Riemann sum 
estimator 



^ m—l 

S2m = - X(,)) , 

which uses the known normalizing constant. Figure 4.7 clearly illustrates the 
difference in convergence speed between the two estimators and the much 
greater stability of 62 m, which is close to the theoretical value 5.3185 after 
3000 iterations. || 

If the original simulation is done by importance sampling (that is, if the 
sample Xi, . . . ^Xm is generated from an instrumental distribution ^), since 
the integral 3 can also be written 

3 = y h{x)^^g{x)dx, 

the Riemann sum estimator (4.13) remains unchanged. Although it has simi- 
lar convergence properties, the boundedness conditions on h are less explicit 
and, thus, more difficult to check. As in the original case, it is possible to es- 
tablish an equivalent to Theorem 3.12, namely to show that g{x) oc \h{x)\f{x) 
is optimal (in terms of variance) (see Philippe 1997c), with the additional ad- 
vantage that the normalizing constant does not need to be known, since g 
does not appear in (4.13). 
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Fig. 4.7. Convergence of estimators of E[Xlog(V)], the Riemann sum Sim (solid 
lines and smooth curve) and the empirical average S 2 m (dots and wiggly curve) for 
a = 3.7 and P = 1. The final values are 5.3007 and 5.3057, respectively, for a true 
value of 5.31847. 



Example 4.11. (Continuation of Example 3.13) If T(i/, 0, 1) is sim- 
ulated by importance sampling from the normal instrumental distribution 
^^(0, jy/ {u — 2)), the difference between the two distributions is mainly visible 
in the tails. This makes the importance sampling estimator Sim very unsta- 
ble (see Figures 3.5, 3.7, and 3.8). Figure 4.8 compares this estimator to the 
Riemann sum estimator 



C((^+l)/2) 
yirK r{l'/2) 



1 + Vi)/^ 



and its normalized version 



+ 1 

Y:Z~o' [l + xyu] ~ ^ - X(,)) 



(-^(i+1) - ^(i)) 



for hi{X) = (1 + e^) Ix<o and v = 2.3. 

We can again note the stability of the approximations by Riemann sums, 
the difference between ^ 2 m and S^m mainly due to the bias introduced by the 
approximation of the normalizing constant in ^ 3 ^. For the given sample, note 
that S 2 m dominates the other estimators. 

If, instead, the instrumental distribution is chosen to be the Cauchy dis- 
tribution C(0, 1), the importance sampling estimator is much better behaved. 
Figure 4.9 shows that the speed of convergence of the associated estimator 
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Fig. 4.8. Convergence of estimators of Ei.[(l + e^)Ix<o], Sim (solid lines), 62 m 
(dots), and S^m (dashes), for a normal instrumental distribution and u = 2.3. The 
final values are respectively 0.7262, 0.7287, and 0.7329, for a true value of 0.7307. 




iL ^ ^ , J 
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Fig. 4.9. Convergence of estimators of Eiy[(l + e^)Ix<o], Sim (solid lines), S 2 m 
(dots), and 6 zm (dashes) for a Cauchy instrumental distribution and v = 2.3. The 
two Riemann sum approximations are virtually equal except for the beginning sim- 
ulations. The final values are respectively 0.7325, 0.7314, and 0.7314, and the true 
value is 0.7307. 
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is much faster than with the normal instrumental distribution. Although 62m 
and 6srn have the same representation for A/’(0, — 2)) and C(0, 1), the 

corresponding samples differ, and these estimators exhibit an even higher sta- 
bility in this case, giving a good approximation of E[h4{X)] after only a few 
hundred iterations. The two estimators are actually identical almost from the 
start, a fact which indicates how fast the denominator of ^3^ converges to the 
normalizing constant. || 



4.4 Acceleration Methods 

While the different methods proposed in Chapter 3 and in this chapter seem to 
require a comparison, we do not expect there to be any clear-cut domination 
(as was the case with the comparison between Accept-Reject and importance 
sampling in Section 3.3.3). Instead, we look at more global acceleration strate- 
gies, which are more or less independent of the simulation setup but try to 
exploit the output of the simulation in more efficient ways. 

The acceleration methods described below can be used not only in a single 
implementation but also as a control device to assess the convergence of a 
simulation algorithm, following the argument of parallel estimators. For ex- 
ample, if ^1 ^, . . . ^dpm are p convergent estimators of the same quantity 3, 
a stopping rule for convergence is that . . . , 8pm are identical or, given a 
minimum precision requirement e, that 

max \8im ^jm\ ^ ^ 1 



as in Section 4.1. 

4.4.1 Antithetic Variables 

Although the usual simulation methods lead to iid samples (or quasi-iid, see 
Section 2.6.2), it may actually be preferable to generate samples of correlated 
variables when estimating an integral 3, as they may reduce the variance of 
the corresponding estimator. 

A first setting where the generation of independent samples is less desirable 
corresponds to the comparison of two quantities which are close in value. If 

(4.16) J 9i{x)fi{x)dx and ^2 = ^ 92{x)f2{x)dx 

are two such quantities, where Si estimates 3i and 82 estimates ^2, indepen- 
dently of ^1, the variance of (^i — 82) j is then var((5i) + var((^2), which may 
be too large to support a fine enough analysis on the difference 3i —32- How- 
ever, if and 82 are positively correlated, the variance is reduced by a factor 
—2 cov(^i,(^2), which may greatly improve the analysis of the difference. 
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A convincing illustration of the improvement brought by correlated sam- 
ples is the comparison of (regular) statistical estimators via simulation. Given 
a density f{x\6) and a loss function L{5,6), two estimators 6i and S 2 are 
evaluated through their risk functions, R{6i,6) = E[L{Si,6)] and R{ 62 , 0 ). In 
general, these risk functions are not available analytically, but they may be 
approximated, for instance, by a regular Monte Carlo method, 

^ m ^ m 

R{Si,o) = - E ^(<^ 1 (^ 0 , 0), r{ 62 , ^) = - E e), 

i=l i=l 

the Xi^s and E^’s being simulated from f{'\0). Positive correlation between 
L{6i{Xi), 6) and L(^ 2 (Ei), 6) then reduces the variability of the approximation 
of R{5i,e)-R{52,e). 

Before we continue with the development in this section, we pause to make 
two elementary remarks that should be observed in any simulation compari- 
son. 

(i) First, the same sample (ATi, . . . , should be used in the evaluation 
of R{6i^0) and of R{S2,0). This repeated use of a single sample greatly 
improves the precision of the estimated difference R{6i^6) — R{S 2 , 0 ), as 
shown by the comparison of the variances of R{6i,6) — R{S 2 , 0) and of 

TT~L 

- E {L{5i{Xi),e) - L{52{Xi),e)} . 

TTl 

1=1 

(ii) Second, the same sample should be used for the comparison of risks for 
every value of 6. Although this sounds like an absurd recommendation 
since the sample (Xi,...,Xy^) is usually generated from a distribution 
depending on 0, it is often the case that the same uniform sample can 
be used for the generation of the X^’s for every value of 6. Also, in many 
cases, there exists a transformation Mq on X such that if X® r\j f{X\0o), 
MqX^ ~ f{X\6). A single sample (Xi,...,X^) from /(X|^o) is then 
sufficient to produce a sample from f{X\6) by the transform Mq. (This 
second remark is somewhat tangential for the theme of this section; how- 
ever, it brings significant improvement in the practical implementation of 
Monte Carlo methods.) 

The variance reduction associated with the conservation of the underly- 
ing uniform sample is obvious in the graphs of the resulting risk functions, 
which then miss the irregular peaks of graphs obtained with independent sam- 
ples and allow for an easier comparison of estimators. See, for instance, the 
graphs in Figure 3.4, which are based on samples generated independently 
for each value of A. By comparison, an evaluation based on a single sample 
corresponding to A = 1 would give a constant risk in the exponential case. 
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Example 4.12. James-Stein estimation. In the case X ~ Afp{0,lp), the 
transform is the location shift MqX = X -i-O — Oo. When studying positive-part 
James-Stein estimators 



Sa{x) = 




0 < a < 2(p - 2) 



(see Robert 2001, Chapter 2, for a motivation), the squared error risk of Sa can 
be computed “explicitly,” but the resulting expression involves several special 
functions (Robert 1988) and the approximation of the risks by simulation is 
much more helpful in comparing these estimators. Figure 4.10 illustrates this 
comparison in the case p = 5 and exhibits a crossing phenomenon for the risk 
functions in the same region; however, as shown by the inset, the crossing 
point for the risks of 6 a and Sc depends on (a, c). || 




O E 4 e 8 



fSO OOO iterations) 

Fig. 4.10. Approximate squared error risks of truncated James-Stein estimators for 
a normal distribution AfbiO, /s), as a function of ||^||. The inset gives a magnification 
of the intersection zone for the risk functions. 



In a more general setup, creating a strong enough correlation between 6 \ 
and 62 is rarely so simple, and the quest for correlation can result in an increase 
in the conception and simulation burdens, which may even have a negative 
overall effect on the efficiency of the analysis. Indeed, to use the same uniform 
sample for the generation of variables distributed from fi and /2 in (4.16) is 
only possible when there exists a simple transformation from fi to / 2 . For 
instance, if fi or /2 must be simulated by Accept-Reject methods, the use of 
a random number of uniform variables prevents the use of a common sample. 
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The method of antithetic variables is based on the same idea that higher 
efficiency can be brought about by correlation. Given two samples (Xi, . . . , 
Xm) and (Yi, . . . , Tru) from / used for the estimation of 

J / h{x)f{x)dx , 

Jr 

the estimator 

^ m 

( 4 . 17 ) —J2[HXi) + h{Yi)] 

is more efficient than an estimator based on an iid sample of size 2m if the 
variables h{Xi) and h{Yi) are negatively correlated. In this setting, the Y^’s 
are the antithetic variables, and it remains to develop a method for generating 
these variables in an optimal (or, at least, useful) way. However, the correlation 
between h{Xi) and h(Yi) depends both on the pair (X^, Yi) and on the function 
h. (For instance, if h is even, Xi has mean 0, and Xi = —Yi, Xi and Yi are 
negatively correlated, but h{Xi) — h{Yi).) A solution proposed in Rubinstein 
(1981) is to use the uniform variables Ui to generate the X^’s and the variables 
1 — Uito generate the Y^’s. The argument goes as follows: If H = hoF~, Xi = 
F~(Ui), and Yi = F~{1 — Ui), then h{Xi) and h{Yi) are negatively correlated 
when H is a monotone function. Again, such a constraint is often difficult 
to verify and, moreover, the technique only applies for direct transforms of 
uniform variables, thus excluding the Accept-Reject methods. 

Geweke (1988) proposed the implementation of an inversion at the level of 
the Xfs by taking Yi = 2/jl — Xi when / is symmetric around fi. With some 
additional conditions on the function h, the improvement brought by 

m 

— Y.[h{Xi) + h{2ii-Xi)\ 

^ i=l 



^ S MX.) 

1 = 1 

is quite substantial for large sample sizes m. Empirical extensions of this 
approach can then be used in cases where / is not symmetric, by replacing // 
with the mode of / or the median of the associated distribution. Moreover, 
if / is unknown or, more importantly, if fi is unknown, can be estimated 
from a first sample (but caution is advised!). More general group actions can 
also be considered, as in Kong et al. (2003), where the authors replace the 
standard average by an average (over i) of the average of the h{gxi) (over the 
transformations g). 

Example 4.13. (Continuation of Example 3.3) Assume, for the sake of 
illustration, that the noncentral chi squared variables HX^jp are simulated 
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from normal random variables Xi ~ We can create negative cor- 

relation by using Yi = 29 - Xi, which has a correlation of -1 with Xi, to 
produce a second sample, However, the negative correlation does not 

necessarily transfer to the pairs {h{\\Xi\\‘^),h{\\Yi\\‘^)). Figure 4.11 illustrates 
the behavior of (4.17) for 

^i(ll^lP) == IkiP and h2{\\x\\‘^) = I||x||2<||6i||2+p, 

when m = 500 and p = 4, compared to an estimator based on an iid sample 
of size 2m. As shown by the graphs in Figure 4.11, although the correlation 
between /i(||X^|p) and /i(||Fi|pi) is actually positive for small values of \\9\\‘^, 
the improvement brought by (4.17) over the standard average is quite impres- 
sive in the case of hi. The setting is less clear for /i 2 , but the variance of the 
terms of (4.17) is much smaller than its independent counterpart. || 



Vpripim CoTTVlHOon 





Fig. 4.11. Average of the antithetic estimator (4.17) (solid lines) against the aver- 
age of an standard iid estimate (dots) for the estimation of E[/ii(||A|p)] (upper left) 
and E[h2{\\X\f)] (lower left), along with the empirical variance of hi{Xi) -t- hi{Yi) 
(upper center) and h2{Xi) -h h2iXi) (lower center), and the correlation between 
/ii(||Xi|p) and /ii(||17|p) (upper right) and between h2(\\Xi\\‘^) and h2{\\Yi\f) (lower 
right), for m = 500 and p = 4. The horizontal axis is scaled in terms of ||^|| and the 
values in the upper left graph are divided by the true expectation, ||^||^ -h p, and 
the values in the upper central graph are divided by 8||^||^ T 4p. 
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4.4.2 Control Variates 



In some settings, there exist functions /iq whose mean under / is known. For 
instance, if / is symmetric around /i, the mean of ho{X) = Ix>/j. is 1/2. We 
also saw a more general example in the case of Riemann sums with known 
density /, with a convergent estimator of 1. This additional information can 
reduce the variance of an estimator of3 = f h{x)f{x)dx in the following way. 
If 6i is an estimator of 3 and 6s an unbiased estimator of E/[/io(^)]? consider 
the weighted estimator ^2 = ^1+ P{6s — 'Ef[ho{X)]). The estimators 61 and 
62 have the same mean and 

var(52) = var(Ji) -h 0^ var((53) + 2(3 cov((5i, 5s) . 



For the optimal choice 

^ cov((5i,^3) 

var(53) 

we have 

var((52) = (1 - Pu) var(^i) , 

being the correlation coefficient between and (J3, so the control variate 
strategy will result in decreased variance. In particular, if 

^ m ^ m 

= _ V h{Xi) and <53 = - V , 

2=1 2=1 



the control variate estimator is 

^ m 

<^2 - - E - /3*%[fto(X)] , 

171 

2=1 

with /?* = — cov(h(X), ho{X))/vdiT{ho{X)). Note that this construction is only 
formal since it requires the computation of (3'' . An incorrect choice of (3 may 
lead to an increased variance; that is, var(<52) > var((5i). (However, in practice, 
the sign of (3* can be evaluated by a regression of the /i(xi)’s over the ho{xiys. 
More generally, functions with known expectations can be used as side controls 
in convergence diagnoses.) 

Example 4.14. Control variate integration. Let X ~ /, and suppose 
that we want to evaluate 



P{X > a 




f{x)dx. 



The natural place to start is with Si = ^ SiLi where the X^’s are 

iid from /. 

Suppose now that / is symmetric or, more, generally, that for some pa- 
rameter /i we know the value of P(X > fi) (where we assume that a> fi). We 
can then take 6s — ^ Sr=i ^ m) form the control variate estimator 
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52 = ^^l{Xi>a)+(3l > m) - P{X > m) I 

^ i=l \ i=l J 

Since var(52) = var(^i) + /3^var((53) + 2 / 3 cov(Ji, ^3) and 



(4.18) 



cov(<5i,J 3) = lp(X > a)[l - P{X > /x)] , 
var(53) - 1P(X > m)[1 - P{X > fi)] , 



it follows that ^2 will be an improvement over Si if 
/3<0 and 1,31 < 



var(53) 



P{X>^iY 



If P{X > /i) = 1/2 and we have some idea of the value of P{X > a), we can 
choose an appropriate value for (3 (see Problems 4.15 and 4.18). || 



Example 4.15. Logistic regression. Consider the logistic regression model 
introduced in Example 1.13, 

P{Yi = 1) = exp(a;*0)/{l + exp(x*6»)}. 

The likelihood associated with a sample ((xi, Vi), . . . , (x^, Yn)) can be written 



exp ( ^ YiXi 



n{l + exp(x‘m-'- 

i=l 



When 7 t(^) is a conjugate prior (see Note 1.6.1), 



n 

(4.19) 7t(^|C, A) (X exp(0*C) II {1 + exp(a;-^)}“^, A > 0, 

1=1 



the posterior distribution of 6 is of the same form, with {^. YiXi 1) 

replacing (C, A). 

The expectation + C? A -f 1] is derived from variables Oj 

{I < j < m) generated from (4.19). Since the logistic distribution is in an 
exponential family, the following holds (see Brown 1986 or Robert 2001, Sec- 
tion 3.3): 






Y^YiXi 



= nS/'ijj{9) 



xm 



YiXi + C, A -h 1 



J 2 i YjXj -h c 

n(A + 1) 



and 
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Therefore, the posterior expectation of the function 



n 

nvm = E 

i=l 



exp{xle) 

1 H- exp(xf 0) 



is known and equal to + C)/(^ + 1) under the prior distribution 

7t( 0|^, A). Unfortunately, a control variate version of 



1 






m 



m 






is not available since the optimal constant P* (or even its sign) cannot be 
evaluated, except by the regression of the 6j ’s upon the 

expxle, 

^l+expxl6j 

Thus the fact that the posterior mean of is known does not help us to 

establish a control variate estimator. This information can be used in a more 
informal way to study convergence of (see, for instance, Robert 1993). || 



In conclusion, the technique of control variates is manageable only in very 
specific cases: the control function h must be available, as well as the optimal 
weight /?*. See, however. Brooks and Gelman (1998b) for a general approach 
based on the score function (whose expectation is null under general regularity 
conditions). 



4.5 Problems 



4.1 (Chen and Shao 1997) As mentioned, normalizing constants are superfluous in 
Bayesian inference except in the case when several models are considered at 
once (as in the computation of Bayes factors). In such cases, where 7 Ti(0) = 
ni{0)/ci and 7T2{0) = ^ 2 ( 0 ) I C 2 , and only tti and 7T2 are known, the quantity to 
approximate is ^ = ci/c 2 or ^ = log(ci/c 2 ). 

(a) Show that the ratio g can be approximated by 






Ol^. . . ,6n ~ 7T2. 



{Hint: Use an importance sampling argument.) 

(b) Show that 

J ^i{e)a{e)'K2{e)de _ ci 

/ ^2{0)a{6)'Ki{6)dO C2 

holds for every function a{6) such that both integrals are finite. 
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(c) Deduce that 

^ Erii *2(6>ii)a(6»ii) ’ 

with Oii ~ 7Ti and $ 2 i ~ 7T2, is a convergent estimator of g. 

(d) Show that part (b) covers the case of the Newton and Raftery (1994) rep- 
resentation 

C2 E’^i [7ri(0)“i] ’ 

(e) Show that the optimal choice (in terms of mean square error) of a in part 
(c) is 

(ti's = ^1 + n2 

^ ^ni7ri(0) + n27T2(0) ’ 

where c is a constant. {Note: See Meng and Wong 1996.) 

4.2 (Continuation of Problem 4.1) When the priors tti and 7T2 belong to a pa- 
rameterized family (that is, 7Ti{0) = 7r(^|Ai)), the corresponding constants are 
denoted by c(Ai). 

(a) Verify the identity 



-log 



c(Ai) 

c(A2) 



= E 



' u{e,\y 

. '^(A) , 



where 

C/(0,A) = J^log(7f{0|A)) 

and 7 t(A) is an arbitrary distribution on A. 

(b) Show that ^ can be estimated with the bridge estimator of Gelman and 
Meng (1998), 

- ^ 1 ^ U{9i,\i) 

^ ^ 7r(Ai) 

when the (ft, Ai)’s are simulated from the joint distribution induced by 7 t(A) 
and 7r(^|Ai). 

(c) Show that the minimum variance estimator of ^ is based on 



7t(A) oc ^yEx[U^e,X)] 



and examine whether this solution gives the JeflFreys prior. 

4.3 For the situation of Example 4.2: 

(a) Show that S^{x) as m oo. 

(b) Show that the Central Limit Theorem can be applied to 5^(x). 

(c) Generate random variables 0i,...,6m ~ J\f{x,l) and calculate 6^{x) for 
X = 0, 1,4. Use the Central Limit Theorem to construct a measure of accu- 
racy of your calculation. 

4.4 Verify equation (4.10), that is, show that 



E 



1 



^dx 



y/2a‘^y -h 1 ^ l + 2cr^y’ 

by completing the square in the exponent to evaluate the integral. 
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4.5 In simulation from mixture densities, it is always possible to set up a Rao- 
Blackwellized alternative to the empirical average. 

(a) Show that, if f{x) = f g(xly)h(y)dy, then 

M 1 ^ 

Xi~/and ^ ~ s, 

i=l i=l 

each converge to E/(X). 

(b) For each of the following cases, generate random variables Xi and Yi^ and 
compare the empirical average and Rao-Blackwellized estimator of E/(X) 
and var/(X): 

a) X\y ~ V{y), Y ~ Qa{a,b) (X is negative binomial); 

b) X\y ~ A/"(0, 2/), Y ~ Qa{a,b) (X is a generalized t); 

c) X\y Bin{y), Y ~ Be{a,b) (X is beta-binomial). 

4.6 For the estimator 62 of Section 4.2: 

(a) Verify the expression (4.11) for pi. 

(b) Verify the recursion (4.12). 

(c) Prove Proposition 4.6. {Hint: Show that E[^ 2 ] = E[^i] and apply the Rao- 
Blackwell Theorem.) 

4.7 Referring to Lemma 4.3, show how to use (4.5) to derive the expression for the 
asymptotic variance of SJ^. 

4.8 Given an Accept-Reject algorithm based on (f^g^p)^ we denote by 



b{yj) = 



(1 -p)/(i/j) 

givj) - pfivi) 



the importance sampling weight of the rejected variables (Yi, . . . ,Y), and by 
(Vi, . . . , Xn) the accepted variables. 

(a) Show that the estimator 






+ _L. 

n-\-t n-\-t 



^ 0 , 



with 



and 



5^ = \Ym)h{Y^) 






5^^ = l±h{X,), 



does not uniformly dominate 5^^. {Hint: Consider the constant functions.) 
(b) Show that 



b2w — 



^ s:AR , t K^j) 



n Yt 



E 



is asymptotically equivalent to (5i in terms of bias and variance. 

(c) Deduce that S 2 w asymptotically dominates (5^^ if (4.20) holds. 

4.9 Referring to Section 4.1.2: 

(a) Show that cov(Xfc,Xfc/) == cr^/max{A;, /c'}, regardless of the distribution of 

X,. 
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(b) Verify that Z* ^ is given by (4.7) for n = 3, 10, 25. 

(c) Establish a recursion relation for calculating the elements aij of 





if i = j < n 




<s>. 

II 

II 


-ij 


if \i-j\ = 1 


.0 


otherwise. 



(d) If we denote by the inverse matrix corresponding to (Xi, X 2 , . . . , Xfc), 
show that to get you only have to change one element, then add one 

row and one column to E^^ . 

4.10 (a) Establish (4.8) by showing that (i) = n and (ii) = n. 

(b) Show that a reasonable approximation for dn is d^(n + 2\/2n). {Hint: Con- 
sider the mean and variance of the distribution.) 

4.11 Referring to Example 4.2: 

(a) Compare a running mean plot with ordinary univariate normal error bars 
to the variance assessment of (4.9). Discuss advantages and disadvantages. 

(b) Compare a running mean plot with empirical error bars to the variance 
assessment of (4.9). Use 500 estimates to calculate 95% error bars. Discuss 
advantages and disadvantages. 

(c) Repeat parts (a) and (b) for h{x) = and thus assess the estimate of the 
posterior variance of 0. 

4.12 In the setting of Section 4.3, examine whether the substitution of 



n — 1 
i=i 



2 



into 

n— 1 
i=l 



improves the speed of convergence. {Hint Examine the influence of the remain- 
der terms 

X(i) n+00 

h{x)f{x)dx and / h{x)f{x)dx. 

4.13 Show that it is always possible to express an integral 3 as 3 = h{y)f{y)dy, 
where the integration is over (0, 1) and h and / are transforms of the original 
functions. 

4.14 In this problem we will prove Proposition 4.9 

(a) Define U-i — 0 and Um+i = I, and show that S can be written 



(5 = 



h{Ui){Ui+x -Ui)=Y. / 



h{Ui)du, 



and thus the difference {3 — 6) can be written as (H'^) ~ 

h{Ui)) du. 

(b) Show that the first-order expansion h{u) = h{Ui)-\-h' {Q{u — Ui), Q G [Ui,u], 
implies that \h{u) — h{Ui)\ < c{u - Ui), with c = sup[o,i] \h'{X)\. 
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(c) Show that 

var(5) = E[(a - 5)^] < {(m + 2) E[Zf] + {m+ l)(m + 2) E[Zfzj]} , 

where Z{ = Ui-\-i — Ui. 

(d) The Ui^s are the order statistics of a uniform distribution. Show that (i) 

the variables Zi are jointly distributed according to a Dirichlet distribution 
T>m(l, •••,!), with Zi ~ Be{l,m) and {Zi, Zj, l — Zi — Zj) ~ T>3(1, l,m — 1), 
(ii) E[Zf] = m, /; ^"(1 - zr-^dz = and E[ZfZ]] = 

(e) Finally, establish that 

{ (m + 4)(m + 3)(m + 1 ) + (m + 4)^ + 3) } = ’ 

proving the proposition. 

4.15 For the situation of Example 4.14: 

(a) Verify (4.18). 

(b) Verify the conditions on /3 in order for S 2 to improve on ^ 1 . 

(c) For / the density of J\f{0, 1), find P{X > a) for a = 3, 5, 7. 

(d) For / the density of T5, find P(X > a) for a = 3, 5, 7. 

(e) For / the density of T5, find a such that P{X > a) = .01, .001, .0001. 

4.16 A naive way to implement the antithetic variable scheme is to use both U and 

{1 — U) in an inversion simulation. Examine empirically whether this method 
leads to variance reduction for the distributions (i) fi{x) = 1/7t(1 + x^), (ii) 
f 2 (x) = (iii) fsix) = e~Hx>o, (iv) f 4 {x) = (l + x^/3)~^, and (v) 

fdix) = 2x~^Ix>i. Examine variance reductions of the mean, second moment, 
median, and 75th percentile. 

To calculate the weights for the Rao-Blackwellized estimator of Section 4.2, it 
is necessary to derive properties of the distribution of the random variables in 
the Accept-Reject algorithm [A A]. The following problem is a rather straight- 
forward exercise in distribution theory and is only made complicated by the 
stopping rule of the Accept-Reject algorithm. 

4.17 This problem looks at the performance of a termwise Rao-Blackwellized esti- 
mator. Casella and Robert (1998) established that such an estimator does not 
sacrifice much performance over the full Rao-Blackwellized estimator of Propo- 
sition 4.6. Given a sample (Vi, . . . , Yn) produced by an Accept-Reject algorithm 
to accept m values, based on (f,g,M): 

(a) Show that 



with 






1 



n — m 



n — 1 

KYn) + Y b(Yi)h{Yi) 

i=l 




m{g{Yi)-pm)) 

p)f{Yi)J 
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(b) If Sn = show that 



* = g 

asymptotically dominates the usual Monte Carlo approximation, condi- 
tional on the number of rejected variables m under quadratic loss. {Hint: 
Show that the sum of the weights Sn can be replaced by (n — m — 1) in 5 
and assume Ef[h{X)] = 0.) 

4.18 Strawderman (1996) adapted the control variate scheme to the Accept-Reject 
algorithm. When Vi, . . . , Y}v is the sample produced by an Accept-Reject algo- 
rithm based on g, let m denote the density 






n — 1 



n — 1 1 — p 



when N = n and p = 
(a) Show that 



M' 



3 = 



J h{x)f{x)dx = En 



E 



' h{Y)f{Y) 

m{Y) 




where m is the marginal density of Yi (see Problem 3.29). 
(b) Show that for any function c(-) and some constant j3^ 



3 = PE[c(Y)] + E 



' h{Y)f{Y) 

m{Y) 



- 0c{Y) 



(c) Setting d{y) — h{y) f {y) / m{y) , show that the optimal choice of (5 is 



I3* = cov[d(y), c(y)]/var[c(V)]. 

(d) Examine choices of c for which the optimal f3 can be constructed and, thus, 
where the control variate method applies. 

{Note: Strawderman 1996 suggests estimating (3 with /3, the estimated slope of 
the regression of d{yi) on c{yi), z = 1, 2, . . . , n — 1.) 

4.19 For t ~ Qeo{p), show that 



= E[r^] = e^.p) , 

1 - P 1 - p 

where Li(x) is the dilog function (see Note 4.6.2). 

4.20 (Continuation of Problem 4.19) If V ~ J\feg{n,p), show that 



E[({N-l){N-2))-^] = 



{n — l){n — 2) 



EffAT - 1)“^1 = - p) 

‘ (n-l)2 
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4.21 (Continuation of Problem 4.19) If Li is the dilog function, show that 

lim — — = 1 and lim log(p)Li(l — p) = 0 . 

1 — p 

Deduce that the domination corresponding to (4.20) occurs on an interval of 
the form [po, 1]. 

4.22 Given an integral 

3 = f h{x)f{x)dx 

Jx 

to be evaluated by simulation, compare the usual Monte Carlo estimator 






np 

i=l 



based on an iid sample (Xi, . . . , Xnp) with a stratified estimator (see Note 4.6.3) 






n 

3 = 1 i=l 



where pj = / f{x)dx, X = \ \ Xj and , . . . is a sample from flxj’ 

j=i 

Show that ^2 does not bring any improvement if the pj ’s are unknown and must 
be estimated. 



4.6 Notes 

4.6.1 Monitoring Importance Scimpling Convergence 

With reference to convergence control for simulation methods, importance sampling 
methods can be implemented in a monitored way; that is, in parallel with other evalu- 
ation methods. These can be based on alternative instrumental distributions or other 
techniques (standard Monte Carlo, Markov chain Monte Carlo, Riemann sums, Rao- 
Blackwellization, etc.). The respective samples then provide separate evaluations of 
E[/i(X)], through (3.8), (3.11), or yet another estimator (as in Section 3.3.3), and the 
convergence criterion is to stop when most estimators are close enough. Obviously, 
this empirical method is not completely foolproof, but it generally prevents pseudo- 
convergences when the instrumental distributions are sufficiently different. On the 
other hand, this approach is rather conservative^ as it is only as fast as the slowest 
estimator. However, it may also point out instrumental distributions with variance 
problems. From a computational point of view, an efficient implementation of this 
control method relies on the use of parallel programming in order to weight each 
distribution more equitably, so that a distribution / of larger variance, compared 
with another distribution p, may compensate this drawback by a lower computation 
time, thus producing a larger sample in the same time.^ 

^ This feature does not necessarily require a truly parallel implementation, since it 
can be reproduced by the cyclic allocation of uniform random variables to each 
of the distributions involved. 
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4.6.2 Accept— Reject with Loose Bounds 



For the Accept-Reject algorithm [A. 4], some interesting results emerge if we assume 
that 

f{x) < M g{x)/{l + s) ; 

that is, the bound M is not tight. If we assume Ef[h{X)] = 0, the estimator 64 of 
(3.16) then satisfies 



var((54) < 



E 



t -1 



M -1 

e 



+ E[r^] 



Ef[h\X)] , 



where 

- p) , 

I- P 

and Li(a:) denotes the dilog function, 



k=l 

(see Abramowitz and Stegun 1964, formula 27.7, who also tabulated the function), 
which can also be written as 



Jx 1 

(see Problem 4.19). The bound on the variance of <54 is thus 

var(,54)< K^-e-')Li(l-p)-e-Mog(p)| Ef[h^{X)] 

and ^4 uniformly dominates the usual Accept-Reject estimator 6 i of (3.15) as long 
as 

(4.20) log(;9) - Li(l - p)} + < 1 . 

1 -P 

This result rigorously establishes the advantage of recycling the rejected variables 
for the computation of integrals, since (4.20) does not depend on the function h. 
Note, however, that the assumption Ef[h{X)] = 0 is quite restrictive, because the 
sum of the weights of 64 does not equal 1 and, therefore, 64 does not correctly 
estimate constant functions (except for the constant function h = 0). Therefore, Si 
will dominate 64 for constant functions, and a uniform comparison between the two 
estimators is impossible. 

Figure 4.12 gives the graphs of the left-hand side of (4.20) for £ = 0.1, 0.2, . . . , 
0.9. A surprising aspect of these graphs is that domination (that is, where the curve 
is less than 1) occurs for larger values of p, which is somewhat counterintuitive, 
since smaller values of p lead to higher rejection rates, therefore to larger rejected 
subsamples and to a smaller variance of 64 for an adequate choice of density functions 
g. On the other hand, the curves are correctly ordered in e since larger values of e 
lead to wider domination zones. 
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Fig. 4.12. Graphs of the variance coefficients of ^4 for e = 0.1, , 0.9, the curves 
decreasing in e. The domination of (^1 by J 4 occurs when the variance coefficient is 
less than 1 . 



4.6.3 Partitioning 

When enough information is available on the function /, stratified sampling may be 
used. This technique (see Hammersley and Handscomb 1964 or Rubinstein 1981 
for references) decomposes the integration domain A' in a partition A’i,...,A:'p, 
with separate evaluations of the integrals on each region that is, the integral 
h{x)f{x)dx is expressed as 

/ h{x)f{x)dx = S^ / h{x)f{x)dx = y^Qi / h{x)fj{x)dx = , 

i=i i=i i=i 

where the weights qi are the probabilities of the regions Ai and the fiS are the 
restrictions of / to these regions. 

Then samples of size m are generated from the /i’s to evaluate each integral 3i 
separately by a regular estimator CJ*. 

The motivation for this approach is that the variance of the resulting estimator, 
+ • • • + that is, 

y'ei— / {h{x) -'Sif fi{x)dx, 

may be much smaller than the variance of the standard Monte Carlo estimator based 

on a sample of size n = ni -\ h rip. The optimal choice of the rii’s in this respect 

is such that 

(n*)^ (X f (h{x) - 3i)‘^fi{x)dx. 

Jxi 

Thus, if the regions Ai can be chosen, the variance of the stratified estimator can 
be reduced by selecting Ai’s with similar variance factors (h(x) — Ji)‘^ fi{x)dx. 
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While this approach may seem far-fetched, because of its requirements on the 
distribution /, note that it can be combined with importance sampling, where the 
importance function g may be chosen in such a way that its quantiles are well known. 
Also, it can be iterated, with each step producing an evaluation of the Qi's and of 
the corresponding variance factors, which may help in selecting the partition and 
the rii's in the next step (see, however. Problem 4.22). 

An extension proposed by McKay et al. (1979) introduces stratification on all 
input dimensions. More precisely, if the domain is represented as the unit hypercube 
7i in and the integral (3.5) is written as h{x)dx, Latin hypercube sampling 
relies on the simulation of (a) d random permutations itj of {l,...,n} and (b) 
uniform ^([0,1]) rv’s (j = l,...,d, 1 < zi,..., 2 d < n). A sample of n 

vectors Xi in the unit hypercube is then produced as X(7Ti(i), . . . ,7Td(z)) with 

X(zi , . . . , ici ) — (X i(ii,...,i<;i ),..., Xd (^i5 • • • ?^d))? 

Xj(ii,...,id) = , l<i<d. 

n 

The component Xj(zi, . . . ,id) is therefore a point taken at random on the interval 
[ij-i/n,ij /n] and the permutations ensure that no uniform variable is taken twice 
from the same interval, for every dimension. Note that we also need only generate 
n X d uniform random variables. (Latin hypercubes are also used in agricultural 
experiments to ensure that all parcels and all varieties are used, at a minimal cost. 
See Mead 1988 or Kuehl 1994.) McKay et al. (1979) show that when h is a real- 
valued function, the variance of the resulting estimator is substantially reduced, 
compared with the regular Monte Carlo estimator based on the same number of 
samples. Asymptotic results about this technique can be found in Stein (1987) and 
Loh (1996). (It is, however, quite likely that the curse of dimensionality, see Section 
4.3, occurs for this technique.) 




5 



Monte Carlo Optimization 



“Remember, boy,” Sam Nakai would sometimes tell Chee, “when you’re 
tired of walking up a long hill you think about how easy it’s going to be 
walking down.” 

— Tony Hillerman, A Thief of Time 

This chapter is the equivalent for optimization problems of what Chapter 3 
is for integration problems. Here we distinguish between two separate uses of 
computer generated random variables. The first use, as seen in Section 5.2, is 
to produce stochastic techniques to reach the maximum (or minimum) of a 
function, devising random explorations techniques on the surface of this func- 
tion that avoid being trapped in a local maximum (or minimum) but also that 
are sufficiently attracted by the global maximum (or minimum). The second 
use, described in Section 5.3, is closer to Chapter 3 in that it approximates 
the function to be optimized. The most popular algorithm in this perspective 
is the EM (Expectation-Maximization) algorithm. 



5.1 Introduction 

Similar to the problem of integration, differences between the numerical ap- 
proach and the simulation approach to the problem 

(5.1) max h{0) 

6^0 

lie in the treatment of the function^ h. (Note that (5.1) also covers minimiza- 
tion problems by considering —h.) In approaching an optimization problem 

^ Although we use 0 as the running parameter and h typically corresponds to a 
(possibly penalized) transform of the likelihood function, this setup applies to 
inferential problems other than likelihood or posterior maximization. As noted in 
the introduction to Chapter 3, problems concerned with complex loss functions 
or confidence regions also require optimization procedures. 
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using deterministic numerical methods, the analytical properties of the tar- 
get function (convexity, boundedness, smoothness) are often paramount. For 
the simulation approach, we are more concerned with h from a probabilistic 
(rather than analytical) point of view. Obviously, this dichotomy is somewhat 
artificial, as there exist simulation approaches where the probabilistic inter- 
pretation of h is not used. Nonetheless, the use of the analytical properties of 
h plays a lesser role in the simulation approach. 

Numerical methods enjoy a longer history than simulation methods (see, 
for instance, Kennedy and Gentle 1980 or Thisted 1988), but simulation meth- 
ods have gained in appeal due to the relaxation of constraints both on the 
regularity of the domain O and on the function h. Of course, there may exist 
an alternative numerical approach which provides an exact solution to (5.1), 
a property rarely achieved by a stochastic algorithm, but simulation has the 
advantage of bypassing the preliminary steps of devising an algorithm and 
studying whether some regularity conditions on h hold. This is particularly 
true when the function h is very costly to compute. 

Example 5.1. Signal processing. 6 Ruanaidh and Fitzgerald (1996) study 
signal processing data, of which a simple model is (i = 1, . . . , N”) 

Xi = ai cos{luU) -h 02 sin{u;ti) + e^, ~ Af{0, a^). 



with unknown parameters a = (oi, 02 ), c<;, and a and observation times ti, . . ., 
t]sf- The likelihood function is then of the form 






with X (xi, . . . ,Xiv) and 



G = 



/ cos(a;^i) sin(cjti)\ 



\cos{u;t]sf) sm{ujt]sf)/ 



The prior 7r(a,uj, a) = a ^ leads to the marginal distribution 



(5.2) 7r(a;|x) oc (x*x-x*G(G‘G)-^G‘x)^^ 

which, although explicit in cu, is not particularly simple to compute. This setup 
is also illustrative of functions with many modes, as shown by O Ruanaidh 
and Fitzgerald (1996). || 



Following Geyer (1996), we want to distinguish between two approaches 
to Monte Carlo optimization. The first is an exploratory approach, in which 
the goal is to optimize the function h by describing its entire range. The 
actual properties of the function play a lesser role here, with the Monte Carlo 
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aspect more closely tied to the exploration of the entire space 0, even though, 
for instance, the slope of h can be used to speed up the exploration. (Such 
a technique can be useful in describing functions with multiple modes, for 
example.) The second approach is based on a probabilistic approximation 
of the objective function h and is somewhat of a preliminary step to the 
actual optimization. Here, the Monte Carlo aspect exploits the probabilistic 
properties of the function h to come up with an acceptable approximation and 
is less concerned with exploring 0. We will see that this approach can be tied 
to missing data methods^ such as the EM algorithm. We note also that Geyer 
(1996) only considers the second approach to be “Monte Carlo optimization.” 
Obviously, even though we are considering these two different approaches 
separately, they might be combined in a given problem. In fact, methods like 
the EM algorithm (Section 5.3.2) or the Robbins-Monro algorithm (Section 
5.5.3) take advantage of the Monte Carlo approximation to enhance their 
particular optimization technique. 



5.2 Stochastic Exploration 

5.2.1 A Basic Solution 

There are a number of cases where the exploration method is particularly well 
suited. First, if 0 is bounded, which may sometimes be achieved by a repa- 
rameterization, a first approach to the resolution of (5.1) is to simulate from 
a uniform distribution on 0, wi, . . . , Um ~ Uq^ and to use the approximation 
= max(/i(ixi), . . . , h{um))- This method converges (as m goes to oo), but 
it may be very slow since it does not take into account any specific feature of 
h. Distributions other than the uniform, which can possibly be related to /i, 
may then do better. In particular, in setups where the likelihood function is 
extremely costly to compute, the number of evaluations of the function h is 
best kept to a minimum. 

Example 5.2. A first Monte Carlo maximization. Recall the function 
that we looked at in Example 3.4, h{x) = [cos(50a:) -h sin(20x)]^. Since 
the function is defined on a bounded interval, we try our naiVe strat- 
egy and simulate ~ t/(0, 1), and use the approximation = 

ma,x{h{ui) , . . . , h{um)) ^ The results are shown in Figure 5.1. There we see 
that the random search has done a fine job of mimicking the function. The 
Monte Carlo maximum is 3.832, which agrees perfectly with the “true” max- 
imum, obtained by an exhaustive evaluation. 

Of course, this is a small example, and as mentioned above, this naive 
method can be costly in many situations. However, the example illustrates 
the fact that in low-dimensional problems, if function evaluation is rapid, this 
method is a reasonable choice. || 
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Fig. 5.1. Calculation of the maximum of the function (1.26), given in Example 3.4. 
The left panel is a graph of the function, and the right panel is a scatterplot of 5000 
random Z//(0, 1) and the function evaluated at those points. 



This leads to a second, and often more fruitful, direction, which relates h 
to a probability distribution. For instance, if h is positive and if 

I h{0) dO < +00 , 

Je 

the resolution of (5.1) amounts to finding the modes of the density h. More 
generally, if these conditions are not satisfied, then we may be able to trans- 
form the function h{6) into another function H{9) that satisfies the following: 

(i) The function H is non-negative and satisfies J H < oo. 

(ii) The solutions to (5.1) are those which maximize H{6) on 0. 

For example, we can take 

H{6) = exp{h{0)/T) or H{0) = exp{h{0)/T}/{l -h exp{h{0)/T}) 

and choose T to accelerate convergence or to avoid local maxima (as in simu- 
lated annealing; see Section 5.2.3). When the problem is expressed in statis- 
tical terms, it becomes natural to then generate a sample (0i, . . . , Om) from h 
(or H) and to apply a standard mode estimation method (or to simply com- 
pare the /i(^i)’s). (In some cases, it may be more useful to decompose h{6) 
into h{9) = hi{9)h2{9) and to simulate from hi.) 



Example 5.3. Minimization of a complex function. Consider minimiz- 
ing the (artificially constructed) function in 
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h{x,y) = (xsin(20^) + ^sin(20x))^ cosh(sin(10x)x) 

-h (xcos(lOy) — y sin(lOx))^ cosh(cos(20y)y) , 

whose global minimum is 0, attained at (x, y) = (0, 0). (This is the big brother 
of the function in Example 5.2.) Since this function has many local minima, as 
shown by Figure 5.2, it does not satisfy the conditions under which standard 
minimization methods are guaranteed to provide the global minimum. On the 
other hand, the distribution on with density proportional to exp(— /i(x, y)) 
can be simulated, even though this is not a standard distribution, by using, 
for instance, Markov chain Monte Carlo techniques (introduced in Chapters 
7 and 8), and a convergent approximation of the minimum of h{x^y) can be 
derived from the minimum of the resulting /i(xi,yi)’s. An alternative is to 
simulate from the density proportional to 

hi{x,y) = exp{— (xsin(20y) + ysin(20x))^ — (xcos(lOy) — y sin(lOx))^}, 

which eliminates the computation of both cosh and sinh in the simulation 
step. II 




Fig, 5.2, Grid representation of the function h(x,y) of Example 5,3 on [—1,1]^, 



Exploration may be particularly difficult when the space 0 is not convex 
(or perhaps not even connected). In such cases, the simulation of a sample 
{Oi^ ... ,0m) may be much faster than a numerical method applied to (5.1). 
The appeal of simulation is even clearer in the case when h can be represented 
as 



h{0) = / H{x,0)dx . 
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In particular, if H{x, 6) is a density and if it is possible to simulate from this 
density, the solution of (5.1) is the mode of the marginal distribution of 6. 
(Although this setting may appear contrived or even artificial, we will see in 
Section 5.3.1 that it includes the case of missing data models.) 

We now look at several methods to find maxima that can be classified as 
exploratory methods. 

5.2.2 Gradient Methods 

As mentioned in Section 1.4, the gradient method is a deterministic numerical 
approach to the problem (5.1). It produces a sequence {6j) that converges to 
the exact solution of (5.1), 0*, when the domain 0 C and the function 
(— h) are both convex. The sequence {6j) is constructed in a recursive manner 
through 

(5.3) 6j^i = 6j -h ajVh{9j) , > 0 , 

where V/i is the gradient of h. For various choices of the sequence (aj) (see 
Thisted 1988), the algorithm converges to the (unique) maximum. 

In more general setups (that is, when the function or the space is less reg- 
ular), equation (5.3) can be modified by stochastic perturbations to again 
achieve convergence, as described in detail in Rubinstein (1981) or Dufio 
(1996, pp. 61-63). One of these stochastic modifications is to choose a second 
sequence (pj) to define the chain {9j) by 

(5.4) 0.^^=0. + ^^Ah{ej,(3jQ)Q. 

The variables Q are uniformly distributed on the unit sphere ||C|| = 1 and 
Ah{x^y) — h{x y) — h{x — y) approximates 2\\y\\Vh{x). In contrast to the 
deterministic approach, this method does not necessarily proceed along the 
steepest slope in 9j , but this property is sometimes a plus in the sense that it 
may avoid being trapped in local maxima or in saddlepoints of h. 

The convergence of (9j) to the solution 9* again depends on the choice of 
{aj) and (/3^). We note in passing that {9j) can be seen as a nonhomogeneous 
Markov chain (see Definition 6.4) which almost surely converges to a given 
value. The study of these chains is quite complicated given their ever-changing 
transition kernel (see Winkler 1995 for some results in this direction). How- 
ever, sufficiently strong conditions such as the decrease of aj toward 0 and of 
aj / pj to a nonzero constant are enough to guarantee the convergence of the 
sequence (9j). 

Example 5.4. (Continuation of Example 5.3) We can apply the iterative 
construction (5.4) to the multimodal function h{x, y) with different sequences 
of Oj’s and Pj's. Figure 5.3 and Table 5.1 illustrate that, depending on the 
starting value, the algorithm converges to different local minima of the func- 
tion h. Although there are occurrences when the sequence h{9j) increases 
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and avoids some local minima, the solutions are quite distinct for the three 
different sequences, both in location and values. 

As shown by Table 5.1, the number of iterations needed to achieve stability 
of 6t also varies with the choice of Note that Case 1 results in a very 

poor evaluation of the minimum, as the fast decrease of (aj) is associated with 
big jumps in the first iterations. Case 2 converges to the closest local minima, 
and Case 3 illustrates a general feature of the stochastic gradient method, 
namely that slower decrease rates of the sequence (aj) tend to achieve better 
minima. The final convergence along a valley of h after some initial big jumps 
is also noteworthy. || 



aj 


ft 


Ot 


/i(^t) 


mint h(6t) 


Iteration 

T 


l/Wj 


1/lOj 


(-0.166, 1.02) 


1.287 


0.115 


50 


1/lOOi 


1/lOOi 


(0.629, 0.786) 


0.00013 


0.00013 


93 


l/101og(l+j) 


1/i 


(0.0004, 0.245) 


4.24 X 10“® 


2.163 X 10'^ 


58 



Table 5.1. Results of three stochastic gradient runs for the minimization of the func- 
tion h in Example 5.3 with different values of (aj,/3j) and starting point (0.65, 0.8). 
The iteration T is obtained by the stopping rule \\9t — 0t-i\\ < 10 “^. 



This approach is still quite close to numerical methods in that it requires a 
precise knowledge on the function /i, which may not necessarily be available. 



5.2.3 Simulated Annealing 

The simulated annealing algorithm^ was introduced by Metropolis et al. 
(1953) to minimize a criterion function on a finite set with very large size^, 
but it also applies to optimization on a continuous set and to simulation (see 
Kirkpatrick et al. 1983, Ackley et al. 1985, and Neal 1993, 1995). 

The fundamental idea of simulated annealing methods is that a change 
of scale, called temperature^ allows for faster moves on the surface of the 
function h to maximize, whose negative is called energy. Therefore, rescaling 
partially avoids the trapping attraction of local maxima. Given a temperature 
parameter T > 0, a sample 0^, ... is generated from the distribution 

^ This name is borrowed from wetallurgy: a metal manufactured by a slow decrease 
of temperature (annealing) is stronger than a metal manufactured by a fast de- 
crease of temperature. There is also input from physics, as the function to be 
minimized is called energy and the variance factor T, which controls convergence, 
is called temperature. We will try to keep these idiosyncrasies to a minimal level, 
but they are quite common in the literature. 

^ This paper is also the originator of the Markov chain Monte Carlo methods de- 
veloped in the following chapters. 
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l/101og(l +j), = 1/j 



Fig. 5.3. Stochastic gradient paths for three different choices of the sequences 
(aj)and {13 j) and starting point (0.65,0.8) for the same sequence (Cj) in (5.4). The 
gray levels are such that darker shades mean higher elevations. The function h to 
be minimized is defined in Example 5.3. 
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7t{9) oc exp{h{ 6 )/T) 

and can be used as in Section 5.2.1 to come up with an approximate maximum 
of /i. As T decreases toward 0, the values simulated from this distribution 
become concentrated in a narrower and narrower neighborhood of the local 
maxima of h (see Theorem 5.10 and Winkler 1995). 

The fact that this approach has a moderating eflPect on the attraction of the 
local maxima of h becomes more apparent when we consider the simulation 
method proposed by Metropolis et al. (1953). Starting from 6 o, C is generated 
from a uniform distribution on a neighborhood V(^o) of or, more generally, 
from a distribution with density ^'(iC — ^ol), and the new value of 9 is generated 
as follows: 



0 — with probability p = exp(Z\/i/T) A 1 
1 9q with probability 1 — p, 

where Ah = h{Q — h{9o). Therefore, if h{Q > h{9o), ( is accepted with 
probability 1; that is, 9q is always changed into C- On fho other hand, if 
h{Q < h{9o)^ ^ may still be accepted with probability p ^ 0 and 9q is then 
changed into This property allows the algorithm to escape the attraction 
of ^0 if 9q is a local maximum of /i, with a probability which depends on the 
choice of the scale T, compared with the range of the density g. (This method 
is in fact the Metropolis algorithm, which simulates the density proportional to 
exp{h{9)/T}, as the limiting distribution of the chain as described 

and justified in Chapter 7.) 

In its most usual implementation, the simulated annealing algorithm mod- 
ifies the temperature T at each iteration; it is then of the form 

Algorithm A. 19 -Simulated Annealing- 

1, Simulate ^ from an instrumental distribution 
with density p(|C — J 

2, Accept 0^+1 = C with probability 

Pi = exp{zi/ii/Ti} A 1 ; [AT9] 

take Oi^i = 9i otherwise . 

3, Update Ti to , 

Example 5.5. A first simulated annealing maximization. We again 
look at the function from Example 5.2, 



h{x) = [cos(50r) +sin(20x)]^ 
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Fig. 5.4. Calculation of the maximum of the function (1.26), given in Example 5.5. 
The four different panels show the trajectory of 2500 pairs (x^^\ for each of 

four different runs. 



and apply a simulated annealing algorithm to find the maximum. The specific 
algorithm we use is 

At iteration t the algorithm is at 

1. Simulate ^ U(at,bt) where at ^ — r,0) and bf = 

1 ) 

2. Accept = u with probability 

= mm {«.p ( '■(") 

take a,(t+i) ^ a,(t) 

Otherwise. 

3. Update Tt to Tt+i. 

For r == .5 and Tt = 1/ log(t), the results of the algorithm are shown in Fig- 
ure 5.4. The four panels show different trajectories of the points h{x^^^). 
It is interesting to see how the path moves toward the maximum fairly rapidly, 
and then remains there, oscillating between the two maxima (remember that 
h is symmetric around 1/2. 

The value of r controls the size of the interval around the current point (we 
truncate to stay in (0, 1) and the value of Tt controls the cooling. For different 
values of r and T the path will display different properties. See Problem 5.4 
and the more complex Example 5.9. || 

An important feature of the simulated annealing algorithm is that there 
exist convergence results in the case of finite spaces, as Theorem 5.7 below, 
which was proposed by Hajek (1988). (See Winkler 1995, for extensions.) 
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Consider the following notions, which are used to impose restrictions on the 
decrease rate of the temperature: 

Definition 5.6. Given a finite state-space £ and a function h to be maxi- 
mized: 

(i) a state ej € £ can be reached at altitude h from state G f if there exists 
a sequence of states ei, . . . , linking and e^, such that h{ek) > h for 
k = 1, . . . , n; 

(ii) the height of a maximum ei is the largest value di such that there exists a 
state ej such that h{ej) > h{ei) which can be reached at altitude h{ei)-\-di 
from ei. 



Thus, h{ei)-\-di is the altitude of the highest pass linking ei and ej through 
an optimal sequence. (In particular, h{ei) + di can be larger than the altitude 
of the closest pass relating and ej.) By convention, we take — oo if 

is a global maximum. If O denotes the set of local maxima of E and O is the 
subset of O of global maxima, Hajek (1988) establishes the following result: 

Theorem 5.7. Consider a system in which it is possible to link two arbitrary 
states by a finite sequence of states. If, for every h>0 and every pair {ei, ej), 
ei can be reached at altitude h from ej if and only if ej can be reached at 
altitude h from ei, and if {Ti) decreases toward 0, the sequence {6i) defined by 
Algorithm [^.19] satisfies 



lim P{6i eO) = l 
2— >00 



if and only if 

oo 

^ exp(-D/Ti) = +00 , 

i=l 

with D — min{di : ei e O — O}. 

This theorem therefore gives a necessary and sufficient condition, on the 
rate of decrease of the temperature, so that the simulated annealing algorithm 
converges to the set of global maxima. This remains a relatively formal re- 
sult since D is, in practice, unknown. For example, if Ti = T/logi, there is 
convergence to a global maximum if and only ii T > D. Numerous papers 
and books have considered the practical determination of the sequence (T^) 
(see Geman and Geman 1984, Mitra et al. 1986, Van Laarhoven and Aarts 
1987, Aarts and Kors 1989, Winkler 1995, and references therein). Instead of 
the above logarithmic rate, a geometric rate, Ti = a'^To (0 < o; < 1), is also 
often adopted in practice, with the constant a calibrated at the beginning of 
the algorithm so that the acceptance rate is high enough in the Metropolis 
algorithm (see Section 7.6). 
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The fact that approximate methods are necessary for optimization prob- 
lems in finite state-spaces may sound rather artificial and unnecessary, but the 
spaces involved in some modeling can be huge. For instance, a black-and-white 
TV image with 256 x 256 pixels corresponds to a state-space with cardinality 
2256x256 ^ ]^q 20 ,ooo Similarly, the analysis of DNA sequences may involve 600 
thousand bases (A, C, G, or T), which corresponds to state-spaces of size 
4600,000 (ggg Churchill 1989, 1995^ 

Example 5.8. Ising model. The Ising model can be applied in electromag- 
netism (Cipra 1987) and in image processing (Geman and Geman 1984). It 
models two-dimensional tables s, of size D x D, where each term of s takes the 
value +1 or —1. The distribution of the entire table is related to the (so-called 
energy) function 

(5.5) h{s) = ~ ’ 

i 

where i denotes the index of a term of the table and M is an equivalence 
neighborhood relation, for instance, when i and j are neighbors either verti- 
cally or horizontally. (The scale factors J and H are supposedly known.) The 
model (5.5) is a particular case of models used in spatial statistics (Gressie 
1993) to describe multidimensional correlated structures. 

Note that the conditional representation of (5.5) is equivalent to a logit 
model on Si = {si + l)/2, 

(5.6) P{Si = l\s„j^i)=^^ , 

with g — g{sj) = 2{H + ftie sum being taken on the neighbors of 

i. For known parameters H and J, the inferential question may be to obtain 
the most likely configuration of the system; that is, the minimum of h{s). 
The implementation of the Metropolis et al. (1953) approach in this setup, 
starting from an initial value is to modify the sites of the table s one at a 
time using the conditional distributions (5.6), with probability exp(— Zlh/T), 
ending up with a modified table and to iterate this method by decreasing 
the temperature T at each step. The reader can consult Swendson and Wang 
(1987) and Swendson et al. (1992) for their derivation of efficient simulation 
algorithms in these models and accelerating methods for the Gibbs sampler 
(see Problem 7.43). || 

Duflo (1996, pp. 264-271) also proposed an extension of these simulated 
annealing methods to the general (continuous) case. Andrieu and Doucet 
(2000) give a detailed proof of convergence of the simulated annealing al- 
gorithm, as well as sufficient conditions on the cooling schedule, in the setup 
of hidden Markov models (see Section 14.6.3). Their proof, which is beyond 
our scope, is based on the developments of Haario and Sacksman (1991). 
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Case 


Ti 


6 t 


h(9T) 


mint h{9t) 


Accept, rate 


1 


1/lOi 


(-1.94, -0.480) 


0.198 


4.02 10“^ 


0.9998 


2 


l/log(l + i) 


(-1.99, -0.133) 


3.408 


3.823 X lO-’^ 


0.96 


3 


100/log(l +i) 


(-0.575, 0.430) 


0.0017 


4.708 X 10“® 


0.6888 


4 


l/101og(l+i) 


(0.121, -0.150) 


0.0359 


to 

CO 

00 

to 

X 
1— ‘ 
o 

1 


0.71 



Table 5.2. Results of simulated annealing runs for different values of Ti and starting 
point (0.5, 0.4). 



Example 5.9. (Continuation of Example 5.3) We can apply the algo- 
rithm [A. 19] to find a local minimum of the function h of Example 5.3, or 
equivalently a maximum of the function exp(— y)/Ti). We choose a uni- 
form distribution on [—0.1,0.!] for and different rates of decrease of the 
temperature sequence (T^). As illustrated by Figure 5.5 and Table 5.2, the 
results change with the rate of decrease of the temperature Ti. Case 3 leads to 
a very interesting exploration of the valleys of h on both sides of the central 
zone. Since the theory (Duflo 1996) states that rates of the form T/ log(z -h 1) 
are satisfactory for T large enough, this shows that T = 100 should be ac- 
ceptable. Note also the behavior of the acceptance rate in Table 5.2 for Step 
2 in algorithm [A. 19]. This is indicative of a rule we will discuss further in 
Chapter 7 with Metropolis-Hastings algorithms, namely that superior perfor- 
mances are not always associated with higher acceptance rates. || 



5.2.4 Prior Feedback 

Another approach to the maximization problem (5.1) is based on the re- 
sult of Hwang (1980) of convergence (in T) of the so-called Gibbs measure 
exp{h{9)/T) (see Section 5.5.3) to the uniform distribution on the set of global 
maxima of h. This approach, called recursive integration or prior feedback in 
Robert (1993) (see also Robert and Soubiran 1993), is based on the following 
convergence result. 

Theorem 5.10. Consider h a real-valued function defined on a closed and 
bounded set, 0, o/R^. If there exists a unique solution 6* satisfying 

6* = argmax h{0 ) , 

6^0 

then 

L 0 dd 

lim -f ■ . . . = 0 * , 

\—^oo Jq Q\h(6) flQ 

provided h is continuous at . 



See Problem 5.6 for a proof. More details can be found in Pincus (1968) 
(see also Robert 1993 for the case of exponential families and Duflo 1996, pp. 
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(2)Ti = l/log(t + l) 

Fig. 5.5. Simulated annealing sequence of 5000 points for three different choices of 
the temperature Ti in [A. 19] and starting point (0.5, 0.4), aimed at minimizing the 
function h of Example 5.3. 
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244-245, for a sketch of a proof). A related result can be found in D’Epifanio 
(1989, 1996). A direct corollary to Theorem 5.10 then justifies the recursive 
integration method which results in a Bayesian approach to maximizing the 
log-likelihood, £(6\x). 

Corollary 5.11. Let ir be a positive density on O. If there exists a unique 
maximum likelihood estimator , it satisfies 

f n{0)de _ 

Tr{e)d0 ~ ■ 

This result uses the same technique as in Theorem 5.10, namely the 
Laplace approximation of the numerator and denominator integrals (see also 
Tierney et al. 1989). It mainly expresses the fact that the maximum likeli- 
hood estimator can be written as a limit of Bayes estimators associated with 
an arbitrary distribution tt and with virtual observations corresponding to the 
Ath power of the likelihood, exp{A^(0|x)}. When A G N, 

/ n{6)de 

- j e^m7r(d)dd 

is simply the Bayes estimator associated with the prior distribution tt and a 
corresponding sample which consists of A replications of the initial sample x. 
The intuition behind these results is that as the size of the sample goes to 
infinity, the infiuence of the prior distribution vanishes and the distribution 
associated with exp(A^(0|x))7r(^) gets more and more concentrated around 
the global maxima of £(0jx) when A increases (see, e.g., Schervish 1995). 

Prom a practical point of view, the recursive integration method can be 
implemented by computing the Bayes estimators (x) for i = 1,2,... until 
they stabilize. 

Obviously, it is only interesting to maximize the likelihood by this method 
when more standard methods like the ones above are difficult or impossible 
to implement and the computation of Bayes estimators is straightforward. 
(Chapters 7 and 9 show that this second condition is actually very mild.) It 
is, indeed, necessary to compute the Bayes estimators S^(x) corresponding to 
a sequence of A’s until they stabilize. Note that when iterative algorithms are 
used to compute (^J(x), the previous solution (in A) of 5J(x) can serve as the 
new initial value for the computation of S^(x) for a larger value of A. This 
feature increases the analogy with simulated annealing. The differences with 
simulated annealing are: 

(i) for a fixed temperature (1/A), the algorithm converges to a fixed value, 

X7T. 

(ii) a continuous decrease of 1/A is statistically meaningless; 
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A 5 10 100 1000 5000 10^ 

2.02 2.04 1.89 1.98 1.94 2.00 



Table 5.3. Sequence of Bayes estimators of for the estimation of a when X ~ 
G{a, 1) and x = 1.5. 



(iii) the speed of convergence of A to +oo does not formally matter^ for the 
convergence of {x) to 0 * ; 

(iv) the statistical motivation of this method is obviously stronger, in partic- 
ular because of the meaning of the parameter A; 

(v) the only analytical constraint on £{6\x) is the existence of a global maxi- 
mum, (see Robert and Titterington 1998 for extensions). 

Example 5.12. Gamma shape estimation. Consider the estimation of 
the shape parameter, o, of a ^(a, P) distribution with /3 known. Without loss 
of generality, take P — 1. For a constant (improper) prior distribution on o, 
the posterior distribution satisfies 

7TA(a|a:) oc . 

For a fixed A, the computation of E[a|a:, A] can be obtained by simulation with 
the Metropolis-Hastings algorithm (see Chapter 7 for details) based on the 
instrumental distribution Sxp{l/a^^~^^)^ where denotes the previous 

value of the associated Markov chain. Table 5.3 presents the evolution of 
= E[a|x, A] against A, for x = 1.5. 

An analytical verification (using a numerical package like Mathematica) 
shows that the maximum of x^ /F {a) is, indeed, close to 2.0 for x = 1.5. || 

The appeal of recursive integration is also clear in the case of constrained 
parameter estimation. 

Example 5.13. Isotonic regression. Consider a table of normal observa- 
tions Xi^j ~ 1) with means that satisfy 

9i—ij V 6i j—i ^ 6ij ^ ^2+1, j A 

Dykstra and Robertson (1982) have developed an efficient deterministic algo- 
rithm which maximizes the likelihood under these restrictions (see Problems 
1.18 and 1.19.) However, a direct application of recursive integration also pro- 
vides the maximum of likelihood estimator of ^ = (^u )^ requiring neither an 
extensive theoretical study nor high programming skills. 

^ However, we note that if A increases too quickly, the performance is affected in 
that there may be convergence to a local mode (see Robert and Titterington 1998 
for an illustration). 
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Table 5.4 presents the data of Robertson et al. (1988), which relates the 
notes at the end of first year with two entrance exams at the University of 
Iowa. Although these values are bounded, it is possible to use a normal model 
if the function to minimize is the least squares criterion (1.8), as already 
pointed out in Example 1.5. Table 5.5 provides the solution obtained by re- 
cursive integration in Robert and Hwang (1996); it coincides with the result 
of Robertson et al. (1988). || 



ACT 


1-12 


13-15 


16-18 


19 - 


-21 


22-24 


25 


-27 


28-30 


31-33 


34-36 


91 - 


99 


1.57 ( 4 ) 


2.11 ( 5 ) 


2.73 ( 18 ) 


2.96 


( 39 ) 


2.97 ( 126 ) 


3.13 


( 219 ) 


3.41 ( 232 ) 


3.45 ( 47 ) 


3.51 ( 4 ) 


81 - 


90 


1.80 ( 6 ) 


1.94 ( 15 ) 


2.52 bo ) 


2.68 


( 65 ) 


2.69 ( 117 ) 


2.82 


( 143 ) 


2.75 ( 70 ) 


2.74 ( 8 ) 


- ( 0 ) 


71 - 


80 


1.88 ( 10 ) 


2.32 ( 13 ) 


2.32 ( 51 ) 


2.53 


( 83 ) 


2.58 015 ) 


2.55 


( 107 ) 


2.72 {2A) 


2.76 O ) 


- ( 0 ) 


61 - 


70 


2.11 ( 6 ) 


2.23 ( 32 ) 


2.29 l59) 


2.29 


( 84 ) 


2.50 ( 75 ) 


2.42 


( 44 ) 


2.41 ( 19 ) 


- ( 0 ) 


- ( 0 ) 


51 - 


60 


1.60 ( 11 ) 


2.06 be ) 


2.12 09 ) 


2.11 


( 63 ) 


2.31 (57) 


2.10 


( 40 ) 


1.58 ( 4 ) 


2.13 ( 1 ) 


- ( 0 ) 


41 - 


50 


1.75 ( 6 ) 


1.98 02 ) 


2.05 ( 31 ) 


2.16 


( 42 ) 


2.35 04 ) 


2.48 


( 21 ) 


1.36 O ) 


- ( 0 ) 


- ( 0 ) 


31 - 


40 


1.92 ( 7 ) 


1.84 ( 6 ) 


2.15 ( 5 ) 


1.95 


( 27 ) 


2.02 ( 13 ) 


2.10 


( 13 ) 


1.49 O ) 


- ( 0 ) 


- ( 0 ) 


21 - 


30 


1.62 d ) 


2.26 ( 2 ) 


1.91 (5) 


1.86 


( 14 ) 


1.88 Ol ) 


3.78 ( 1 ) 


1.40 O ) 


- ( 0 ) 


- ( 0 ) 


00 - 


20 


1.38 b ) 


1.57 ( 2 ) 


2.49 (5) 


2.01 


( 7 ) 


2.07 ( 7 ) 


- 


( 0 ) 


0.75 O) 


- ( 0 ) 


- ( 0 ) 



Table 5.4. Average grades of first-year students at the University of Iowa given 
their rank at the end of high school (HSR) and at the ACT exam. Numbers in 
parentheses indicate the number of students in each category. {Source: Robertson 
et al. 1988.) 



ACT 1 - 12 13 - 15 16 - 18 19 - 21 22 - 24 25 - 27 28 - 39 31 - 32 34 - 36 



91 


-99 


1.87 


2.18 


2.73 


2.96 


2.97 


3.13 


3.41 


3.45 


3.51 


81 


-89 


1.87 


2.17 


2.52 


2.68 


2.69 


2.79 


2.79 


2.80 


— 


71 


-79 


1.86 


2.17 


2.32 


2.53 


2.56 


2.57 


2.72 


2.76 


— 


61 


-69 


1.86 


2.17 


2.29 


2.29 


2.46 


2.46 


2.47 


— 


— 


51 


-59 


1.74 


2.06 


2.12 


2.13 


2.24 


2.24 


2.24 


2.27 


— 


41 


-49 


1.74 


1.98 


2.05 


2.13 


2.24 


2.24 


2.24 


— 


— 


31 


-39 


1.74 


1.94 


1.99 


1.99 


2.02 


2.06 


2.06 


— 


— 


21 


-29 


1.62 


1.93 


1.97 


1.97 


1.98 


2.05 


2.06 


— 


— 


00 


-20 


1.38 


1.57 


1.97 


1.97 


1.97 


- 


1.97 


- 


- 



Table 5.5. Maximum likelihood estimates of the mean grades under lexicographical 
constraint. {Source: Robert and Hwang 1996). 
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5.3 Stochastic Approximation 



We next turn to methods that work more directly with the objective function 
rather than being concerned with fast explorations of the space. Informally 
speaking, these methods are somewhat preliminary to the true optimization 
step, in the sense that they utilize approximations of the objective function 
h. We note that these approximations have a different purpose than those we 
have previously encountered (for example, Laplace and saddlepoint approxi- 
mations in Section 3.4 and Section 3.6.2). In particular, the methods described 
here may sometimes result in an additional level of error by looking at the 
maximum of an approximation to h. 

Since most of these approximation methods only work in so-called miss- 
ing data models^ we start this section with a brief introduction to these 
models. We return to the assumption that the objective function h satis- 
fies h{x) = E[iL(x, Z)] and (as promised) show that this assumption arises in 
many realistic setups. Moreover, note that artificial extensions (or demarginal- 
ization) ^ which use this representation, are only computational devices and do 
not invalidate the overall inference. 



5.3.1 Missing Data Models and Demarginalization 

In the previous chapters, we have already met structures where some missing 
(or latent) element greatly complicates the observed model. Examples include 
the obvious censored data models (Example 1.1), mixture models (Example 
1.2), where we do not observe the indicator of the component generating the 
observation, or logistic regression (Example 1.13), where the observation Yi 
can be interpreted as an indicator that a continuous variable with logistic 
distribution is less than X^jS. 

Missing data models are best thought of as models where the likelihood 
can be expressed as 

(5.7) g(x\0) = J f{x,z\d)dz 

or, more generally, where the function h{x) to be optimized can be expressed 
as the expectation 

(5.8) h{x) = E[H{x, Z)] . 

This assumption is relevant, and useful, in the setup of censoring models: 

Example 5.14. Censored data likelihood. Suppose that we observe Fi, 
. . ., Yn, iid, from f{y — 9) and we have ordered the observations so that 
y = (yi, * * ■ 5 ym) are uncensored and (ym+i, • • • , yn) are censored (and equal 
to a). The likelihood function is 
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Fig. 5.6. The left panel shows three “likelihoods” of a sample of size 25 from a 
A/^(4, 1). The leftmost is the likelihood of the sample where values greater than 
4.5 are replaced by the value 4.5 (dotted), the center (solid) is the observed-data 
likelihood (5.9), and the rightmost (dashed) is the likelihood using the actual data. 
The right panel shows EM (dashed) and MCEM (solid) estimates; see Examples 
5.17 and 5.20. 



(5.9) L(01y) = [1 - F{a - 0)]”-™ H /(?/, - 6) , 

i=l 

where F is the cdf associated with /. If we had observed the last n — m 
values, say z = ( 2 :^+ 1 , • • • , with 2 :^ > a (i = m + 1, . . . , n), we could have 
constructed the (complete data) likelihood 

m n 

L^{e\y,z)=llf{yi-e) n 

i=l 

with which it often is easier to work. Note that 

L(6»|y) = Z)] = L''{0\y, z)/(z|y, 6) dz, 

where /(z|y, 6) is the density of the missing data conditional on the observed 
data. For f{y — 0) = 1) three likelihoods are shown in Figure 5.6. Note 

how the observed-data likelihood is biased down from the true value of || 

When (5.7) holds, the Z vector merely serves to simplify calculations, 
and the way Z is selected to satisfy (5.8) should not affect the value of 
the estimator. This is a missing data model, and we refer to the function 
L^{6\x,z)) = f{x,z\6) as the “complete-model” or “complete-data” likelihood, 
which corresponds to the observation of the complete data {x,z). This com- 
plete model is often within the exponential family framework, making it much 
easier to work with (see Problem 5.14). 
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More generally, we refer to the representation (5.7) as demarginalization^ 
a setting where a function (or a density) of interest can be expressed as an 
integral of a more manageable quantity. We will meet such setups again in 
Chapters 8-9. They cover models such as missing data models (censoring, 
grouping, mixing, etc.), latent variable models (tobit, probit, arch, stochastic 
volatility, etc.) and also artificial embedding, where the variable Z in (5.8) 
has no meaning for the inferential or optimization problem, as illustrated by 
slice sampling (Chapter 8). 



5.3.2 The EM Algorithm 



The EM {Expectation-Maximization) algorithm was originally introduced by 
Dempster et al. (1977) to overcome the difficulties in maximizing likelihoods 
by taking advantage of the representation (5.7) and solving a sequence of easier 
maximization problems whose limit is the answer to the original problem. It 
thus fits naturally in this demarginalization section, even though it is not a 
stochastic algorithm in its original version. Monte Carlo versions are examined 
in Section 5.3.3 and in Note 5.5.1. Moreover, the EM algorithm relates to 
MCMC algorithms in the sense that it can be seen as a forerunner of the 
Gibbs sampler in its Data Augmentation version (Section 10.1.2), replacing 
simulation by maximization. 

Suppose that we observe Ai, . . . , A^, iid from g{x\0) and want to compute 
6 = argmaxL(0|x) = Yi^=i We augment the data with z, where 

X, Z ~ /(x, z|0) and note the identity (which is a basic identity for the EM 
algorithm) 



(5.10) 



k{z\0,x) 



/(x,z|6>) 



where /c(z|0,x) is the conditional distribution of the missing data Z given 
the observed data x. The identity (5.10) leads to the following relationship 
between the complete-data likelihood L^(0|x, z) and the observed-data likeli- 
hood L{6\x). For any value 



(5.11) logI,(6»|x) = E0o[logL‘'(6'|x,z)] - EeJlogfc(z|6l,x)], 



where the expectation is with respect to k{z\9o^:x.). We now see the EM algo- 
rithm as a demarginalization model. However, the strength of the EM algo- 
rithm is that it can go further. In particular, to maximize logL(^|x), we only 
have to deal with the first term on the right side of (5.11), as the other term 
can be ignored. 

Common EM notation is to denote the expected log-likelihood by 



(5.12) Q(6>|6»o,x) = E0o[logL‘=(6'|x,z)]. 



We then maximize Q{0\0o,x), and if ^(i) is the value of 9 maximizing 
Q{9\9q,x.), the process can then be repeated with 9q replaced by the up- 
dated value 0(1). In this manner, a sequence of estimators 0(j), j = 1,2, . . ., 
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is obtained iteratively where 6(^j^ is defined as the value of 6 maximizing 
that is, 

(5.13) Q(%)|%_i),x) = max Q(0|^_i),x). 

u 

The iteration described above contains both an expectation step and a 
maximization step, giving the algorithm its name. At the jth step of the 
iteration, we calculate the expectation (5.12), with 9q replaced by {the 

E-step), and then maximize it {the M-step). 

Algorithm A* 20 -The EM Algorithm- 
ic Compute 

x) = [log L*'(6i|x, z)] , 

where the expectation is with respect to k{z\0m,x) 

Cthe E-step) * 

2, Maximize in 9 and take (the M-step) [A.20] 

^(m+i) =argmp Q(0|^(m),x). 

The iterations are conducted until a fixed point of Q is obtained. 

The theoretical core of the EM Algorithm is based on the fact that by 
maximizing at each step, the likelihood on the left side of (5.11) 

is increased at each step. The following theorem was established by Dempster 
et al. (1977). 

Theorem 5.15. The sequence (^(j)) defined by (5.13) satisfies 



with equality holding if and only z/ (5(0 q_^i) x) = Q(0(j) x). 

Proof. On successive iterations, it follows from the definition of that 

Thus, if we can show that 

(5.14) E^^^Jlogfc(Z|%+i),x)] < E^^^Jlogfc(Z|%),x)] , 

it will follow from (5.11) that the value of the likelihood is increased at each 
iteration. 

Since the difference of the logarithms is the logarithm of the ratio, (5.14) 
can be written as 
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(5.15) 



/ fc(Z|%+i),x) 

V MZ|%),x) 






fc(Z|%+i),x) 

^(Zi%),x) 



where the inequality follows from Jensen’s inequality (see Problem 5.15). The 
theorem is therefore established. □ 



Although Theorem 5.15 guarantees that the likelihood will increase at each 
iteration, we still may not be able to conclude that the sequence (^q)) con- 
verges to a maximum likelihood estimator. To ensure convergence we require 
further conditions on the mapping These conditions are inves- 

tigated by Boyles (1983) and Wu (1983). The following theorem is, perhaps, 
the most easily applicable condition to guarantee convergence to a stationary 
pointy a zero of the first derivative that may be a local maximum or saddle- 
point. 

Theorem 5.16. If the expected complete-data likelihood Q{0\6o,x.) is contin- 
uous in both 6 and 9 q, then every limit point of an EM sequence (^(j)) is a 
stationary point of L{9\x.), and L(0q)|x) converges monotonically to L{6\x) 
for some stationary point 6. 

Note that convergence is only guaranteed to a stationary point. Techniques 
such as running the EM algorithm a number of times with different, random 
starting points, or algorithms such as simulated annealing (see, for example. 
Finch et al. 1989) attempt to give some assurance that the global maximum 
is found. Wu (1983) states another theorem that guarantees convergence to a 
local maximum, but its assumptions are difficult to check. It is usually better, 
in practice, to use empirical methods (graphical or multiple starting values) 
to check that a maximum has been reached. 

As a first example, we look at the censored data likelihood of Example 
5.14. 



Example 5.17. EM for censored data. For Yi ~ Af{0, 1), with censoring 
at a, the complete-data likelihood is 

m n 

L"{0\y, z) oc 11 exp{-(j/i - 6f/2)} H exp{( 2 ;i - df/2}. 

The density of of the missing data z = (zn-m+i, . . . , 2 :^) is a truncated normal 
(5.16) Z ~ tmty) = exp ( ^ (x. - Of ft \ , 

resulting in the expected complete-data log likekihood 

1 m ^ n 

E ^e'[{Z.-0n 

2=1 2=n— m +1 
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Before evaluating the expectation we differentiate and set equal to zero, solv- 
ing for the EM estimate 

- ^ my+{n- m)Ee'{Zi) 
n 

This leads to the EM sequence 



n n I - - §(:>)) 



where 0 and are the normal pdf and cdf, respectively. (See Problem 5.16 
for details of the calculations.) 

The EM sequence is shown in the right panel of Figure 5.6. The conver- 
gence in this problem is quite rapid, giving an MLE of 3.99, in contrast to the 
observed data mean of 3.55. 1 1 



The following example has a somewhat more complex model. 



Example 5.18. Cellular phone plans. A clear case of missing data occurs 
in the following estimation problem. It is typical for cellular phone companies 
to offer “plans” of options, bundling together four or five options (such as 
messaging, caller id, etc.) for one price, or, alternatively, selling them sepa- 
rately. One cellular company had offered a four-option plan in some areas, 
and a five-option plan (which included the four, plus one more) in another 
area. 



In each area, customers were ask to choose their favorite plan, and the 
results were tabulated. In some areas they choose their favorite from four 
plans, and in some areas from five plans. The phone company is interested in 
knowing which are the popular plans, to help them set future prices. A portion 
of the data are given in Table 5.6. We can model the complete data as follows. 
In area i, there are rii customers, each of whom choses their favorite plan from 
Plans 1 — 5. The observation for customer i is Zi — {Zn ^ . . . , Z^s), where Zi 
is A4(l, (pi,P 25 • • • 5 ^ 5 ))- If we assume the customers are independent, in area 
i the data are Ti = (Tn , . . . , T^s) = ~ M{rii, (pi,P 2 , • • • ,Ps)) (see 

Problem 5.23). If the first m observations have the Zj 5 missing, denote the 
missing data by xi and then we have the complete-data likelihood 
(5.18) 



T 2 I , . . . , T^4, Xi 



•••PpVs'x 



where p = (pi,P 2 , • • • ,P5), T = (Ti,T2, . . . ,T5), x = {xi,X 2 , . . . ,Xm), and 
( " ) is the multinomial coefficient — r^r r- The observed-data like- 

Vni ,n2,...,nfe/ ni!n2!---nfc! 

lihood can be calculated as L(p|T) = L(p|T, x) leading to the missing 
data distribution 
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Table 5.6. Cellular phone plan preferences in 37 areas: Data are number of cus- 
tomers who choose the particular plan as their favorite. Some areas ranked 4 plans 
(with the 5*^ plan denoted by -) and some ranked 5 plans. 



(5.19) A:(x|T,p) = ][[ -p5)”‘+^ 

i—1 ^ / 

a product of negative binomial distributions. 

The rest of the EM analysis follows. Define Wj = j ~ • • • 5 4, 

and VE 5 = i = 5. The expected complete-data log likelihood is 

4 m 

^ Wj \ogPj + [1^5 + ^ E(Vj Ip')] log(l -Pl-P2-P3~ Pi) , 

2=1 



leading to the EM iterations 



{t) 



E(x,|p«) = {m + i)^Stv> = 



w, 



1 -'(*) ’ 
1 -P5 






for j = 1 , . . . , 4. The MLE of p is (0.273, 0.329, 0.148, 0.125, 0.125), with con- 
vergence being very rapid. Convergence of the estimators is shown in Figure 
5.7, and further details are given in Problem 5.23. (See also Example 9.22 for 
a Gibbs sampling treatment of this problem.) || 
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Fig. 5.7. EM sequence for cellular phone data, 25 iterations 



One weakness of EM is that it is “greedy” ; it always moves upward to a 
higher value of the likelihood. This means that it cannot escape from a local 
model. The following example illustrates the possible convergence of EM to 
the “wrong” mode, even in a well-behaved case. 

Example 5.19. EM for mean mixtures of normal distributions Con- 
sider the mixture of two normal distributions already introduced in Example 
1 . 10 , 

pU(ni,a^) + 

in the special case where all parameters but (/ii,/i 2 ) are known. Figure 5.8 
(bottom) shows the log-likelihood surface associated with this model and 500 
observations simulated with p = 0.7, a = 1 and (/Xi,/X 2 ) — (0,3.1). As easily 
seen from this surface, the likelihood is bimodal,^ with one mode located near 
the true value of the parameters, and another located at (2, —.5). Running 
EM five times with various starting points chosen at random, we represent 
the corresponding occurences: three out of five sequences are attracted by the 
higher mode, while the other two go to the lower mode (even though the like- 
lihood is considerably smaller). This is because the starting points happened 
to be in the domain of attraction of the lower mode. We also represented 
in Figure 5.8 (top) the corresponding (increasing) sequence of log-likelihood 
values taken by the sequence. Note that in a very few iterations the value 

^ Note that this is not a special occurrence associated with a particular sample: 
there are two modes of the likelihood for every simulation of a sample from this 
mixture, even though the model is completely identifiable. 
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is close to the modal value, with improvement brought by further iterations 
being incremental. || 





Fig. 5.8. Trajectories of five runs of the EM algorithm for Example 5.19 with their 
likelihood values (top) and their position on the likelihood surface (bottom). 



This example reinforces the case of the need for rerunning the algorithm 
a number of times, each time starting from a different initial value. 

The books by Little and Rubin (1987) and Tanner (1996) provide good 
overviews of the EM literature. Other references include Louis (1982), Little 
and Rubin (1983), Laird et al. (1987), Meng and Rubin (1991), Qian and 
Titterington (1992), Liu and Rubin (1994), MacLachlan and Krishnan (1997), 
and Meng and van Dyk (1997). 
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5.3.3 Monte Carlo EM 

A difficulty with the implementation of the EM algorithm is that each “E- 
step” requires the computation of the expected log likelihood Q{9\9o,x). Wei 
and Tanner (1990a,b) propose to use a Monte Carlo approach (MCEM) to 
overcome this difficulty, by simulating Zi , . . . , Zm from the conditional dis- 
tribution /c(z|x, 9) and then maximizing the approximate complete-data log- 
likelihood 

^ m 

(5.20) Q( 6 i| 6 lo,x) = — y^logL'^( 6 i|x,z) . 

171 . 

1=1 

When m goes to infinity, this quantity indeed converges to (5(0|0o,x), and 
the limiting form of the Monte Carlo EM algorithm is thus the regular EM 
algorithm. The authors suggest that m should be increased along with the 
iterations. Although the maximization of a sum like (5.20) is, in general, rather 
involved, exponential family settings often allow for closed-form solutions. 

Example 5.20. MCEM for censored data. The EM solution of Example 
5.17 can easily become an MCEM solution. For the EM sequence 

n 

the MCEM solution replaces EQ(j){Zi) with 
1 ^ 

Zi ^ k{z\e^^\y). 

i=l 

The MCEM sequence is shown in the right panel of Figure 5.6. The conver- 
gence is not quite so rapid as EM. The variability is controlled by the choice 
of M, and a larger value would bring the sequences closer together. || 



Example 5.21. Genetic linkage. A classic (perhaps overused) example of 
the EM algorithm is the genetics problem (see Rao 1973, Dempster et al. 
1977, or Tanner 1996), where observations (xi, 0 : 2 , X 3 , X 4 ) are gathered from 
the multinomial distribution 

Estimation is easier if the xi cell is split into two cells, so we create the 
augmented model 



(^1,^2,X2,X3,X4) -- M 
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with xi = + Z2- The complete-data likelihood function is then sim- 
ply — 0) ^2+3^3^ opposed to the observed-data likelihood function 

(2 -f 0)^1 ^^^(1 — ^)^2+3:3 expected complete log-likelihood function is 

E^o 1(^2 + Xi) log 6 » + (X 2 + X 3 ) log(l - e)] 

= + {X2 + X3) log(l - e) , 



which can easily be maximized in leading to 



9i 



Op xi 
2 + ^0 



+ X 4 



Opxi 

2 0p 



-i- X2 + Xs + X4 



If we instead use the Monte 
with the average 



Carlo EM algorithm, ^o^i/(2 + ^0) is replaced 

^ m 

^ ^ j 

m 



i=\ 



where the z^’s are simulated from a binomial distribution B(xi,^o/(2 + ^o))- 
The maximum in 6 is then 



0i 



T X4 

Zm X2 Xs + X4‘ 



This example is merely an illustration of the Monte Carlo EM algorithm 
since EM also applies. The next example, however, details a situation in which 
the expectation is quite complicated and the Monte Carlo EM algorithm works 
quite nicely. 

Example 5.22. Capture— recapture models revisited. A generalization 
of a capture-recapture model (see Example 2.25) is to assume that an animal 
z, i = 1, 2, . . . , n can be captured at time j, j = 1, 2, . . . , t, in one of m 
locations, where the location is a multinomial random variable 

Of course, the animal may not be captured (it may not be seen, or it may 
have died). As we track each animal through time, we can model this process 
with two random variables. The random variable H can take values in the 
set {1,2,..., m} with probabilities {^1, . • • , ^m}- Given H — the random 
variable X ^ B{pk)^ where pk is the probability of capturing the animal in 
location k. (See Dupuis 1995 for details.) 

As an example, for t = 6, a typical realization for an animal might be 



h=(4,l,-,8,3,-), x=(l,l,0,l,l,0) 
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where h denotes the sequence of observed hj^s and of non-captures. Thus, 
we have a missing data problem. If we had observed all of h, the maximum 
likelihood estimation would be trivial, as the MLEs would simply be the cell 
means. For animal z, we define the random variables Xijk = 1 if animal i is 
captured at time j in location /c, and 0 otherwise, 

Yijk = l{Hij = k)l{Xijk = 1 ) 

(which is the observed data), and 

Zijk = KHij = k)I{Xijk = 0) 

(which is the missing data). The likelihood function is 

L((9i,...,6>m,Pi,...,Pmly,x) 

= ’^L{6i, . . . ■ ,Pm\y,X,z) 

z 

m yn yt 
z k—1 

where the sum over z represents the expectation over all the states that could 
have been visited. This can be a complicated expectation, but the likelihood 
can be calculated by first using an EM strategy and working with the complete 
data likelihood L(0i, . . . , . . . ,p/c|y, x, z), then using MCEM for the cal- 

culation of the expectation. Note that calculation of the MLEs of pi, . . . 
is straightforward, and for 0i, . . . , 0^, we use 

Algorithm A.21 -Capture-recapture MCEM Algorithm- 

1 . (M-step) Take 4 = ^ E"=i E>=i 2/ijfe + 

2 * (Monte Carlo E-step) If Xijk = 0, for £ = 1, . . . , L, generate 

Zijki ^ , 

and calculate 

Zijk , 

I 



Scherrer (1997) examines the performance of more general versions of this 
algorithm and shows, in particular, that they outperform the conditional like- 
lihood approach of Brownie et al. (1993). || 

Note that the MCEM approach does not enjoy the EM monotonicity any 
longer and may even face some smoothness difficulties when the sample used in 
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(5.20) is different for each new value 9j. In more involved likelihoods, Markov 
chain Monte Carlo methods can also be used to generate the missing data sam- 
ple, usually creating additional dependence structures between the successive 
values produced by the algorithm. 

5.3.4 EM Standard Errors 

There are many algorithms and formulas available for obtaining standard 
errors from the EM algorithm (see Tanner 1996 for a selection). However, the 
formulation by Oakes (1999), and its Monte Carlo version, seem both simple 
and useful. 

Recall that the variance of the MLE, is approximated by 

r 1 

Var^« -^logL{e\x) 

Oakes (1999) shows that this second derivative can be expressed in terms of 
the complete-data likelihood 

(5.21) ^ log L{0 \k) 

where the expectation is taken with respect to the distribution of the missing 
data. Thus, for the EM algorithm, we have a formula for the variance of the 
MLE. 

The advantage of this expression is that it only involves the distribution 
of the complete data, which is often a reasonable distribution to work with. 
The disadvantage is that the mixed derivative may be difhcult to compute. 
However, in complexity of implementation it compares favorably with other 
methods. 

For the Monte Carlo EM algorithm, (5.21) cannot be used in its current 
form, as we would want all expectations to be on the outside. Then we could 
calculate the expression using a Monte Carlo sum. However, if we now take 
the derivative inside the expectation, we can rewrite Oakes’ identity as 

^logi(^lx) 

(5.22) =E ^^logL(6»|x,z)^ 

+E (^logL(6i|x,z)) - E (^logi(6»|x,z)^ , 

which is better suited for simulation, as all expectations are on the outside. 
Equation (5.22) can be expressed in the rather pleasing form 
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Iteration 



Fig. 5.9. EM sequence ib one standard deviation and for genetic linkage data, 25 
iterations 



^ log L(6>|x) = E ^ ^ log L(6»|x, z) ^ + var ^ ^ log L(6i|x, z) ^ , 
which allows the Monte Carlo evaluation 



^logL(0|x) 

1 ^ 

1 M / ^ 



M 






M ^ de 

j' = l 



where j = 1, . . . , M are generated from the missing data distribution 

(and have already been generated to do MCEM). 



Example 5.23. Genetic linkage standard errors. For Example 5.21, the 
complete-data likelihood is and applying (5.22) yields 



^logL(6'|x) 



EZ2(1 - 0)2 + Xi{\ - 0)2 + {X2 + a:3)02 
02(1-0)2 



where EZ 2 = xi^/(2 + 6). In practice, we evaluate the expectation at the 
converged value of the likelihood estimator. For these data we obtain 6 — 
.627 with standard deviation .048. Figure 5.9 shows the evolution of the EM 
estimate and the standard error. 1 1 
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5.4 Problems 



5.1 Use a numerical maximizer to find the maximum of 

f{x) = [cos(50x) + sin(20x)]^ . 

Compare to the results of a stochastic exploration. {Note: If you use R the 
functions are optim and optimize.) 

5.2 For each of the likelihood functions in Exercise 1.1, find the maximum with 
both R function optimize and stochastic exploration. 

5.3 (O Ruanaidh and Fitzgerald 1996) Consider the setup of Example 5.1. 

(a) Show that (5.2) holds. 

(b) Discuss the validity of the approximation 

N 9 / ^ ^ 

x^x — x.^G{G^G)~^G^x ~ ^ ~ ]V ( ^ cos{u;ti) + ^ Xi sm{ujti) 

i=l \z=l i—1 

N 

= — 2Sn- 

i=l 

(c) Show that 7 t((x;|x) can be approximated by 



7r(o;|x) oc 



1 - 



2Sn 



E 






(2-N)/2 



5.4 For the situation of Example 5.5: 

(a) Reproduce Figure 5.4. 

(b) For a range of values of r, and a range of values of c, where Tt = cj log(t), 
examine the trajectories of {x^^\ h{x^^^). Comment on the behaviors and 
recommend an optimal choice of r and c. 

5.5 Given a simple Ising model with energy function 



h{s) = f3 ^ SuSv , 

(u,v)EAf 

where J\f is the neighborhood relation that u and v are neighbors either horizon- 
tally or vertically, apply the algorithm [A. 19] to the caises /3 = 0.4 and /3 — 4.5. 

5.6 Here we will outline a proof of Theorem 5.10. It is based on the Laplace approxi- 
mation (see Section 3.4), and is rather painful. The treatment of the error terms 
here is a little cavalier; see Tierney and Kadane (1986) or Schervish (1995) for 
complete details. 

(a) Expand h{0) around 0* up to third order terms to establish 







where O{\0 — 0*\^) < G{\9 — for some constant G and \6 — 0*\ small 
enough. 
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(b) Use Taylor series expansions to show that 

gAO(|0-«*|^) ^ ^ ^^^1^ _ ^.|4^ 

and 

gAfc'"(«-)(e- 0 *)V 6 ^ j \h'"{e*){e - e*f/6 + \o{\e - ri®). 

(c) Substitute the part (b) expressions into the integral of part (a), and write 

0 = where t = 6 — 0* , to obtain 

J J \h'"(0^)t^/Q + \0{\t\)'^]dt 
gAh(«*) j ^gAfc"(«*)t2/2|j Xh"'{e*)t^l6 + XO{\t\)*]dt . 

(d) All of the integrals in part (c) can be evaluated. Show that the ones involving 
odd powers of t in the integrand are zero by symmetry, and 

j \\t\^e 

which bounds the error terms. 

(e) Show that we now can write 

1 0e^^^^^d{e) = V^h"{9*) + ^ , 

I , 

and hence 

Je^H0)d{0) \3/2\^ 

completing the proof. 

(f) Show how Corollary 5.11 follows by starting with f b{6)e^^^^^d{9) instead 

of f d{9) . Use a Taylor series on b{9). 

5.7 For the Student’s t distribution %: 

(a) Show that the distribution % can be expressed as the (continuous) mixture 
of a normal and of a chi squared distribution. {Hint: See Section 2.2.) 

(b) When X given a function h{x) derive a representation of h{x) — 

E[H{x, Z)\x], where Z Ga{{a — l)/2, a/2). 

5.8 For the normal mixture of Example 5.19, assume that ai = a 2 = I for simplic- 
ity’s sake. 

(a) Show that the model does belong to the missing data category by consid- 
ering the vector z = {zi, ... ,Zn) of allocations of the observations Xi to the 
first and second components of the mixture. 

(b) Show that the mth iteration of the EM algorithm consists in replacing the 
allocations Zi by their expectation 

=(1 j 

{p<p{xi; + (1 



AV'«.-)tV2^J < 
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and then setting the new values of the means as 
and 

A‘2’”' = '^^Zi\xi, ni"^~'^'’]xi/'^E[Zi\xi, . 

5.9 Suppose that the random variable X has a mixture distribution (1.3); that is, 
the Xi are independently distributed as 

Xi ~ 0g(x) + (1 - O)h(x), i =: 1, . . . , n, 

where g{') and h(-) are known. An EM algorithm can be used to find the ML 
estimator of 6. Introduce Z\, . . . , Zn, where Zi indicates from which distribution 
Xi has been drawn, so 



Xi\Zi = 1 - g{x), 

Xi\Zi = 0 ~ h{x). 

(a) Show that the complete data likelihood can be written 



L''(6>|x, z) = [zig{xi) + (1 - Zi)h{xi)] (1 - ""C 

i=l 



(b) Show that ¥\Zi\0,Xi] — 6g{xi)/[6g{xi) + (1 — 0)h{xi)] and, hence, that the 
EM sequence is given by 

Q 

0U)9{xi) + (1 - 0(j))h{xi) 



(c) Show that 0(^j) converges to a maximum likelihood estimator of 0. 

5.10 Consider the sample x = (0.12, 0.17, 0.32, 0.56, 0.98, 1.03, 1.10, 1.18, 1.23, 1.67, 
1.68, 2.33), generated from an exponential mixture 



pSxp{\) + (1 - p) Sxp(g). 



(a) Show that the likelihood h{p,X,g) can be expressed as E[i7(x, Z)], where 
z = (zi, , Z 12 ) corresponds to the vector of allocations of the observations 
Xi to the first and second components of the mixture; that is, for i = 
1 ,..., 12 , 



P{zi — 1) = 1 — P[zi = 2) = 



p\ exp(— Axi) 

pAexp(— Axi) + (1 — p)gexp{—gxi)' 



(b) Compare the performances of [A. 22] with those of the EM algorithm in this 
setup. 

5.11 This problem refers to Section 5.5.4. 

(a) Show that f h{x\0)dx = 




5.4 Problems 191 



(b) Show that for any two values 9 and ?7, 



log 



h{x\6) 

h{x\r]) 



h{x\9) ^ cje) _ h{x\e) 

h{x\r]) c{rf) h{x\r}) 



— logE 



h{X\ri)J’ 



where X ~ h{x\r]). {Hint: The last equality follows from part (a) and an 
importance sampling argument.) 

(c) Thus, establish the validity of the approximation 



maxh(x|^) 

X 



max 

X 




where the XiS are generated from h{x\r}) 

5.12 Consider /i(a), the likelihood of the beta B{a,a) distribution associated with 

the observation x = 0.345. 

(a) Express the normalizing constant, c(a), in h{a) and show that it cannot be 
easily computed when a is not an integer. 

(b) Examine the approximation of the ratio c{a)/c{ao), for ao = 1/2 by the 
method of Geyer and Thompson (1992) (see Example 5.25). 

(c) Compare this approach with the alternatives of Chen and Shao (1997), 
detailed in Problems 4.1 and 4.2. 

5.13 Consider the function 



^ \\0fip+\\9f){2p-2+\\e\\^) 

^ > (1 + ||0||2)(p+1 + ||0||2)(p + 3 + ||0||2)- 

when 0 and p — 10. 

(a) Show that the function h{9) has a unique maximum. 

(b) Show that h{9) can be expressed as E[H{ 0 ^Z)], where 2 ; = {zi,Z2,Z3) and 
Zi ^ Sxp{ll2) {i — 1,2,3). Deduce that f{z\x) does not depend on x in 
(5.26). 

(c) When g{z) = exp(— ajzi + 2:2 + 2:3}), show that the variance of (5.26) is 
infinite for some values of t = \\6\\‘^ when a > 1/2. Identify A 2 , the set of 
values of t for which the variance of (5.26) is infinite when a = 2. 

(d) Study the behavior of the estimate (5.26) when t goes from A 2 to its com- 
plement A 2 to see if the infinite variance can be detected in the evaluation 
of h{t). 

5.14 In the exponential family, EM computations are somewhat simplified. Show 

that if the complete data density / is of the form 



f{y, z\9) = h{y, z) exp{i]{e)T{y, z) - -B(6>)}, 

then we can write 



Q(^|r,y) = Ee. [\ogh{y, Z)] + ^ryi(e)Ee. [Ti\y] - 3(9). 

Deduce that calculating the complete-data MLE only involves the simpler ex- 
pectation E6 i* [^i|y]- 

5.15 For density functions / and we define the entropy distance between / and 
g, with respect to / (also known as Kullhack-Leibler information of g at f or 
Kullback-Leibler distance between g and f) as 



Ef[\og{f{X)/g(X))] = / log 



/W 

9{x)_ 



f{x) dx. 
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(a) Use Jensen’s inequality to show that Ef[\og{f{X)/g{X))] > 0 and, hence, 
that the entropy distance is always non- negative, and equals zero if / = 

(b) The inequality in part (a) implies that E/log[p(X)] < E/log[/(X)]. Show 
that this yields (5.15). 

{Note: Entropy distance was explored by Kullback 1968; for an exposition of its 
properties, see, for example. Brown 1986. Entropy distance has, more recently, 
found many uses in Bayesian analysis see, for example, Berger 1985, Bernardo 
and Smith 1994, or Robert 1996b.) 

5.16 Refer to Example 5.17 

(a) Verify that the missing data density is given by (5.16). 

(b) Show that E^/Zi = EM sequence is given by (5.17). 

(c) Reproduce Figure 5.6. In particular, examine choices of M that will bring 
the MCEM sequence closed to the EM sequence. 

5.17 The probit model is a model with a covariate X eW such that 

Y\X = X - B{^{x^(3 ) , 
where (3 and ^ denotes the M{Q, 1) cdf. 

(a) Give the likelihood associated with a sample ((xi, ?/i), . . . , (xn, 2/n)). 

(b) Show that, if we associate with each observation (xi^yi) a missing variable 
Zi such that 

Zi\Xi = a; ~ Af(x^p, 1) Yi = Iz,>o , 

iteration m of the associated EM algorithm is the expected least squares 
estimator 

/?(„) = (X^X)-iX^E;3(_,jZ|x,y], 

where x = (xi, . . . ,Xn), y = ( 2 / 1 , . . • ,2/n), and Z = (Zi, . . . , Zn)^, and X is 
the matrix with columns made of the Xi’s. 

(c) Give the value of Ef 3 [zi\xi, yi]. 

5.18 The following are genotype data on blood type. 



Genotype 


Probability Observed 


Probability 


Frequency 


AA 

AO 


Pa 

‘^PAPO 


A 


Pa + 2pAPo 


riA = 186 


BB 

BO 


Pb 

2pBPO 


B 


Pb + 2pBPo 


Ub = 


AB 


2pAPB 


AB 


PAPB 


UAB = 13 


00 


Po 


0 


Po 


no = 284 



Because of dominance, we can only observe the genotype in the third column, 
with probabilities given by the fourth column. The interest is in estimating the 
allele frequencies pa,Pb, and po (which sum to 1). 

(a) Under a multinomial model, verify that the observed data likelihood is 
proportional to 

(PA + "^PAPoT^iP^B + 2pBPoT^ {PAPbT^^ {poT° ■ 

(b) With missing data Za and Zb, verify the complete data likelihood 

{Af^{‘iPAP0T^-^^{pl)^^{2pBP0T^-^^{pAPBr^^{plT^. 




5.4 Problems 193 



(c) Verify that the missing data distribution is 

Za ^ binomial ( n a, ^ ) and Zb ^ binomial (riB, -^ 5 — — ^ 

V p\ + 2pAPoJ \ Pb + ^^PbPoJ 

and write an EM algorithm to estimate pa,Pb, and po- 
5.19 Cox and Snell (1981) report data on survival time Y in weeks and log^g (initial 
white blood count), x, for 17 patients suffering from leukemia as follows. 



X 


3.36 2.88 3.63 3.41 3.78 4.02 4.0 4.23 3.73 3.85 3.97 4.51 4.54 5.0 5.0 4.72 5.0 


Y 


65 156 100 134 16 108 121 4 39 143 56 26 22 1 1 5 65 



They suggest that an appropriate model for these data is 



Yi = boexpbi{xx - x)£i, £i ^ Exp{l), 
which leads to the likelihood function 

L{bo,bi) = Y[boe^pbi{xx - x) exp {-yi/{bo expbi{xx - x))}. 

i 

(a) Differentiation of the log likelihood function leads to the following equations 
for the MLEs: 



i>0 = Vi exp(-6i(xi - x)) 

i 

0 = '^yi(xi - x) ex.p{-bi{xi - x)). 

i 

Solve these equations for the MLEs {bo — 51.1, hi = — 1 . 1 ). 

(b) It is unusual for survival times to be uncensored. Suppose there is censoring 
in that the experiment is stopped after 90 weeks (hence any Yi greater than 
90 is replaced by a 90, resulting in 6 censored observations.) Order the data 
so that the first m observations are uncensored, and the last n — m are 
censored. Show that the observed-data likelihood is 

m 

L{bo,bi) = ]^6oexp6i(xx - x) exp {-yi/{bo exp{bi{xx - ^)))} 

i=l 

n 

X n (1 - 

i=zm-\-l 



where 



F(a|Xi) = / 60 exp 6 i(xa; — x) exp {— t/(5o exp( 6 i(xx — x)))}dt 
Jo 

= 1 - exp {-a/{boexp{bi{xx - x)))}. 

(c) If we let Zi denote the missing data, show that the Zi are independent with 
distribution 






5oexp6i(xo: - x)exp{-2:/(5oexp(6i(xx - a:)))} 

— r , 0 < Z < OO, 

1 - F{a\xi) 
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(d) Verify that the complete data likelihood is 

m 

L{bo,bi) = Y\boexpbi{xx - x)exp{-yi/{boexp{bi{xx - x)))} 

i=l 

n 

X boexpbi{xx - x)exp{-Zi/{boexp{bi{xx - x)))}, 

i— m+l 

and the expected complete-data log-likelihood is obtained by replacing the 
Zi by their expected value 

E[Zi] = {a -\- bo exp{bi(xx - x)) ^ ■ 

r \ Q,\Xi ) 

(c) Implement an EM algorithm to obtain the MLEs of bo and bi. 

5.20 Consider the following 12 observations from A/2(0, U), with cri,cr 2 , and p un- 
known: 



XI 1 1 -1 -1 2 2 -2 -2 

X2 1 -1 1 -1 -22-2-2 

where ” represents a missing value. 

(a) Show that the likelihood function has global maxima at p = ±1/2, = 

<72 = 8/3, and a saddlepoint at p = 0, — a 2 =5/2. 

(b) Show that if an EM sequence starts with p = 0, then it remains at p = 0 
for all subsequent iterations. 

(c) Show that if an EM sequence starts with p bounded away from zero, it will 
converge to a maximum. 

(d) Take into account roundoff errors; that is, the fact that [x^J is observed 
instead of xi. 

{Note: This problem is due to Murray 1977 and is discussed by Wu 1983.) 

5.21 (O Ruanaidh and Fitzgerald 1996) Consider an AR(p) model 

with €t ~ A/’(0, cr^), observed for t = p + 1, . . . , m. The future values Xm+i, • • • , 
Xn are considered to be missing data. The initial values xi, . . . , Xp are taken to 
be zero. 

(a) Give the expression of the observed and complete-data likelihoods. 

(b) Give the conditional maximum likelihood estimators of <7 and z = 
{Xm+i, • • • , Xn)'i that is, the maximum likelihood estimators when the two 
other parameters are fixed. 

(c) Detail the E- and M-steps of the EM algorithm in this setup, when applied 
to the future values z and when <7 is fixed. 

5.22 We observe independent Bernoulli variables Xi, . . . , Xn, which depend on un- 
observable variables Zi distributed independently as where 

Y — ! ^ if < u 

\ 1 if Z, > u. 

Assuming that u is known, we are interested in obtaining MLEs of C and . 
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(a) Show that the likelihood function is 



S/-, \n-S 

P (1-p) 



where S = "^Xi and 



p — P{Zi > u) — ^ 



a 



(b) If we consider zi, ... ,Zn to be the complete data, show that the complete 
data likelihood is 




and the expected complete-data log- likelihood is 




log(27ra") - ^ E inzflxi] - 2CE[Zi\xi] + C") . 

i=l 

(c) Show that the EM sequence is given by 



1 



-2 _ 1 
^0+1) - - 






where 



= E[Zi\xiX,(^‘^] and Vi{Ca‘^) = E[Z,^|xi, C, cr^] • 



(d) Show that 



E[Zi\xi,C,(T^] = < + <rHi , 

E[Zf\xi,C,a^]=C+<y^+<r{u + OHi(^^^^^ , 



where 



Hi{t) = 



I 

I v^(^) 
I ^(t) 



if Xi = 1 
if Xi = 0. 



(e) Show that C(j) converges to ^ and that converges to d^, the MLEs of C 
and cr^, respectively. 

5.23 Referring to Example 5.18 

(a) Show that if Zi ~ M{rrii]pi, . . . ,pk), z = 1, . . . , n, then ^ ^*5 
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(b) Show that the incomplete data likelihood is given by 



y^L(p|z,x) 



HE 



7li+Xi I T,i T.4 a:. 



uHx 



I Til Ti 

rijl Pi--P4 



n 



n 



J l_i=m + l 



Til, . . . , Ti4 



j=l 



(c) Verify equation (5.19). 

(d) Verify the equations of the expected log-likelihood and the EM sequence. 

5.24 Recall the censored Weibull model of Problems 1.11-1.13. By extending the 
likelihood technique of Example 5.14, use the EM algorithm to fit the Weibull 
model, accounting for the censoring. Use the data of Problem 1.13 and fit the 
three cases outlined there. 

5.25 An alternate implementation of the Monte Carlo EM might be, for Zi , . . . , Zm ~ 
/c(z|x, ^), to iteratively maximize 



logL(6>|x) = — ^{logL‘^(6>|x,Zi) - logk{zi\9,x)} 

Tfl , 
x—1 

(which might more accurately be called Monte Carlo maximum likelihood). 

(a) Show that T(^|x) — > T(^|x) as m ^ oo. 

(b) Show how to use L{9\x) to obtain the MLE in Example 5.22. (Warning: 
This is difficult.) 

5.26 For the situation of Example 5.21, data (xi, X 2 , X 3 , X 4 ) = (125, 18,20,34) are 
collected. 

(a) Use the EM algorithm to find the MLE of 0. 

(b) Use the Monte Carlo EM algorithm to find the MLE of 9. Compare your 
results to those of part (a). 

5.27 For the situation of Example 5.22: 

(a) Verify the formula for the likelihood function. 

(b) Show that the complete-data MLEs are given hy 9k = :^ 

Zij k • 

5.28 For the model of Example 5.22, Table 5.7 contains data on the movement 
between 5 zones of 18 tree swallows with m = t = 5, where a 0 denotes that the 
bird was not captured. 

(a) Using the MCEM algorithm of Example 5.22, calculate the MLEs for 
^i ,...,^5 and pi,...,p 5 . 

(b) Assume now that state 5 represents the death of the animal. Rewrite the 
MCEM algorithm to reflect this, and recalculate the MLEs. Compare them 
to the answer in part (a). 

5.29 Referring to Example 5.23 

(a) Verify the expression for the second derivative of the log-likelihood. 

(b) Reproduce the EM estimator and its standard error. 

(c) Estimate 9, and its standard error, using the Monte Carlo EM algorithm. 
Compare the results to those of the EM algorithm. 
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Time Time Time 





1 


2 


3 


4 


5 




1 


2 


3 


4 


5 




1 


2 


3 


4 


5 


a 


2 


2 


0 


0 


0 


g 


1 


1 


1 


5 


0 


m 


1 


1 


1 


1 


1 


b 


2 


2 


0 


0 


0 


h 


4 


2 


0 




0 


n 


2 


2 


1 


0 


0 


c 


4 


1 


1 


2 


0 


i 


5 


5 


5 


5 


0 


0 


4 


2 


2 


0 


0 


d 


4 


2 


0 


0 


0 


j 


2 


2 


0 


0 


0 


P 


1 


1 


1 


1 


0 


e 


1 


1 


0 


0 


0 


k 


2 


5 


0 


0 


0 


q 


1 


0 


0 


4 


0 


f 


1 


1 


0 


0 


0 


1 


1 


1 


0 


0 


0 


s 


2 


2 


0 


0 


0 



Table 5.7. Movement histories of 18 tree swallows over 5 time periods {Source: 
Scherrer 1997.) 



5.30 Referring to Section 5.3.4 

(a) Show how (5.22) can be derived from (5.21). 

(b) The derivation in Section 5.3.4 is valid for vector parameters 6 = (^i, . . . , Op). 
For a function h{6)^ define 

where is a p x 1 vector and is a p x p matrix. Using this notation, 
show that (5.22) becomes 

logi(6>|x)^^^ = E (logL(6i|x,z)^^'^ +E |^(logL(0|x,z)^^^^ (logL(0|x,z)^^^^ j 
- [e (logL(6>|x,z)'^^^] [e (logL(6>|x,z)*^')] . 

(c) Use the equation in part (6) to attach standard errors to the EM estimates 
in Example 5.18 

5.31 (For baseball fans only) It is typical for baseball announcers to report biased 
information, intended to overstate a player’s ability. If we consider a sequence 
of at-bats as Bernoulli trials, we are likely to hear the report of a maximum 
(the player is 8-out-of-his-last-17) rather than an ordinary average. Assuming 
that Xi, X 2 , . . . , Xn are the Bernoulli random variables representing a player’s 
sequence of at-bats (l=hit, 0=no hit), a biased report is the observance of fc*, 
m*, and r*, where 

* \ 4 " Xn-l + ■ ■ ■ + Xn-i 

771 * m*<i<n ^ + 1 

If we assume that E[Xi] == then 0 is the player’s true batting ability and the 
parameter of interest. Estimation of 9 is difficult using only /c*, m*, and r*, but 
it can be accomplished with an EM algorithm. With observed data (fc*, tti*, r*), 
let z = ( 2 : 1 , ... , Zn-m*-i) be the augmented data. (This is a sequence of O’s and 
I’s that are commensurate with the observed data. Note that Xn-m* is certain 
to equal 0.) 
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(a) Show that the EM sequence is given by 

. ___ k* + E[Sz\eA 

— - , 

where E[Sz\0j] is the expected number of successes in the missing data, 
assuming that Oj is the true value of 6. 

(b) Give an algorithm for computing the sequence {9j). Use a Monte Carlo 
approximation to evaluate the expectation. 



At-Bat k* 


m* 


e 


EM MLE 


339 


12 


39 


0.298 


0.240 


340 


47 


155 


0.297 


0.273 


341 


13 


41 


0.299 


0.251 


342 


13 


42 


0.298 


0.245 


343 


14 


43 


0.300 


0.260 


344 


14 


44 


0.299 


0.254 


345 


4 


11 


0.301 


0.241 


346 


5 


11 


0.303 


0.321 



Table 5.8. A portion of the 1992 batting record of major-league baseball player 
Dave Winfield. 



(c) For the data given in Table 5.8, implement the Monte Carlo EM algorithm 
and calculate the EM estimates. 

(Note: The “true batting average” 0 cannot be computed from the given data 
and is only included for comparison. The selected data EM MLEs are usually 
biased downward, but also show a large amount of variability. See Casella and 
Berger 1994 for details.) 

5.32 The following dataset gives independent observations of Z = (X^Y) ~ 
A/ 2 ( 0 , X) with missing data *. 

X 1.17 -0.98 0.18 0.57 0.21 * * * 

y|o.34 -1.24 -0.13 * * -0.12 -0.83 1.64 

(a) Show that the observed likelihood is 

3 

i=l 

(b) Examine the consequence of the choice of tt{X) oc \X\~^ on the posterior 
distribution of X. 

(c) Show that the missing data can be simulated from 

X* ~ -p^)) (i = 6,7,8), 

Fi* ~ V 0 - 2(1 - P^)) (* = 4,5), 

to derive a stochastic EM algorithm. 
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(d) Derive an efficient simulation method to obtain the MLE of the covariance 
matrix E. 

5.33 The EM algorithm can also be implemented in a Bayesian hierarchical model 
to find a posterior mode. Suppose that we have the hierarchical model 

x\e ~ f{x\e ) , 

6>|A ~ 7t(6I|A) , 

A ~ 7(A) , 

where interest would be in estimating quantities from 7t{0\x). Since 

7t{0\x) = j 7r(^,A|x)dA, 

where 7 t(^, A|x) = 7t( 0|A, (r)7r(Al(r), the EM algorithm is a candidate method for 
finding the mode of 7 t(^|x), where A would be used as the augmented data. 

(a) Define k{\\6, x) = 7t{0, A|x)/7r(^|a:) and show that 

log7r(^|x) = J log7r(^, A|x)/e(A|^*, a:)d A — y \ogk{X\6,x)k{X\0* ,x)dX. 

(b) If the sequence (^(j)) satisfies 

max J log7r{9, X\x)k{X\0(^j),x)dX = J log7r(^(j_i_i), A|x)/c(A|^(^), a:)d A, 

show that log7r(^(j_i_i)|x) > log7r(^(j)|x). Under what conditions will the 
sequence converge to the mode of 7 t(^|x)? 

(c) For the hierarchy 



x\0r^Af{e,i) , 

6>|A-AA(A,1) , 

with 7 t(A) = 1, show how to use the EM algorithm to calculate the posterior 
mode of 7t{0\x). 

5.34 Let X ^ Tu and consider the function 



h{x) = 



exp(— a:^/2) 

[l + {x- ■ 



(a) Show that h{x) can be expressed as the conditional expectation E[iL(x, Z)\x], 
when Z ~ Gcl{u, 1). 

(b) Apply the direct Monte Carlo method of Section 5.5.4 to maximize (5.5.4) 
and determine whether or not the resulting sequence converges to the true 
maximum of h. 

(c) Compare the implementation of (b) with an approach based on (5.26) for 
(i) g = Sxp{X) and (ii) g = f{z\g), the conditional distribution of Z given 
X = g.. For each choice, examine if the approximation (5.26) has a finite 
variance. 

(d) Run [A. 22] to see if the recursive scheme of Geyer (1996) improves the 
convergence speed. 
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5.5 Notes 



5.5.1 Variations on EM 

Besides a possible difficult computation in the E-step (see Section 5.3.3), problems 
with the EM algorithm can occur in the case of multimodal likelihoods. The increase 
of the likelihood function at each step of the algorithm ensures its convergence to 
the maximum likelihood estimator in the case of unimodal likelihoods but implies a 
dependence on initial conditions for multimodal likelihoods. Several proposals can 
be found in the literature to overcome this problem, one of which we now describe. 

Broniatowski et al. (1984) and Celeux and Diebolt (1985, 1992) have tried to 
overcome the dependence of EM methods on the starting value by replacing step 1 
in [A. 20] with a simulation step, the missing data z being generated conditionally on 
the observation x and on the current value of the parameter Om- The maximization 
in step 2 is then done on the (simulated) complete-data log- likelihood, H{x^Zm\0). 
The appeal of this approach is that it allows for a more systematic exploration of 
the likelihood surface by partially avoiding the fatal attraction of the closest mode. 
Unfortunately, the theoretical convergence results for these methods are limited: The 
Markov chain {6m) produced by this variant of EM called SEM (for Stochastic EM) 
is often ergodic, but the relation of the stationary distribution with the maxima of 
the observed likelihood is rarely known (see Diebolt and Ip 1996). Moreover, the 
authors mainly study the behavior of the “ergodic” average, 

M 

M ^ 

m=l 

instead of the “global” mode, 

^(M) = arg max i{0m\x), 

l<m<M 

which is more natural in this setup. Celeux and Diebolt (1990) have, however, solved 
the convergence problem of SEM by devising a hybrid version called SAEM (for Sim- 
ulated Annealing EM), where the amount of randomness in the simulations decreases 
with the iterations, ending up with an EM algorithm. This version actually relates 
to the simulated annealing methods, described in Section 5.2.3. Celeux et al. (1996) 
also propose a hybrid version, where SEM produces fhe starting point of EM, 
when the later applies. See Lavielle and Moulines (1997) for an approach similar to 
Celeux and Diebolt (1990), where the authors obtain convergence conditions which 
are equivalent to those of EM. In the same setting, Doucet et al. (2002) develop 
an algorithm called the SAME algorithm (where SAME stands for state augmenta- 
tion for marginal estimation), which integrates the ideas of Sections 5.2.4 and 5.3.1 
within an MCMC algorithm (see Chapters 7 and 10). 

Meng and Rubin (1991, 1992), Liu and Rubin (1994) , and Meng and van Dyk 
(1997) have also developed versions of EM called ECM which take advantage of 
Gibbs sampling advances by maximizing the complete-data likelihood along succes- 
sive given directions (that is, through conditional likelihoods). 
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5.5.2 Neural Networks 

Neural networks provide another type of missing data model where simulation meth- 
ods are almost always necessary. These models are frequently used in classification 
and pattern recognition, as well as in robotics and computer vision (see Cheng and 
Titterington 1994 for a review on these models). Barring the biological vocabulary 
and the idealistic connection with actual neurons, the theory of neural networks 
covers 

(i) modeling nonlinear relations between explanatory and dependent (explained) 
variables, 

(ii) estimation of the parameters of these models based on a (training) sample 

Although the neural network literature usually avoids probabilistic modeling, these 
models can be analyzed and estimated from a statistical point of view (see Neal 1999 
or Ripley 1994, 1996). They can also be seen as a particular type of nonparametric 
estimation problem, where a major issue is then identifiability. 

A simple classical example of a neural network is the multilayer model (also called 
the backpropagation model) which relates explanatory variables x — (xi,...,Xn) 
with dependent variables y = (?/i, . . . , yn) through a hidden “layer”, h = (/ii, . . . , /ip), 
where (/c = 1, . . . ,p; £ = 1, . . . , n) 



hk — f ^O^fcO ^ ^ ^kjXj^ , 



E[Ye\h] = g + g (dikhkj 5 

and var(y^) = cr^. The functions / and g are known (or arbitrarily chosen) from 
categories such as threshold, f{t) — IIt>o, hyperbolic, f{t) = tanh(t), or sigmoid, 
/(t) = l/(l + e-^). 

As an example, consider the problem of character recognition, where handwritten 
manuscripts are automatically deciphered. The x’s may correspond to geometric 
characteristics of a digitized character, or to pixel gray levels, and the y’s are the 
26 letters of the alphabet, plus side symbols. (See Le Cun et al. 1989 for an actual 
modeling based on a sample of 7291, 16 x 16 pixel images, for 9760 parameters.) 
The likelihood of the multilayer model then includes the parameters a — {ctkj) and 
0 = (Afc) in a nonlinear structure. Assuming normality, for observations (yt,Xt), 
t = 1,2, ... ,T, the log-likelihood can be written 

T n 

i{a,l3\x,y) = -YY ~'E.[yti\xti])'^ /2a^ . 



A similar objective function can be derived using a least squares criterion. The 
maximization of i{a, f3\x,y) involves the detection and the elimination of numerous 
local modes. 



5.5.3 The Robbins— Monro procedure 

The Robbins-Monro algorithm (Robbins and Monro 1951) is a technique of stochas- 
tic approximation to solve for x in equations of the form 
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(5.23) h{x) = p , 

when h{x) can be written as in (5.8). It was also extended by Kiefer and Wolfowitz 
(1952) to the more general setup of (5.1). In the case of (5.23), the Robbins-Monro 
method proceeds by generating a Markov chain of the form 

(5.24) X,+i = X,- H{Zj,Xi)) , 

where Zj is simulated from the conditional distribution defining (5.8). The follow- 
ing result then describes sufficient conditions on the 7j’s for the algorithm to be 
convergent (see Bouleau and Lepingle 1994 for a proof). 

Theorem 5.24. If {'Jn) is a sequence of positive numbers such that 

oo oo 

^ 7n = +00 and E 7n < +00, 

n=l n=l 

if the Xj ’s are simulated from H conditionally on Oj such that 



and \xj\ < B for a fixed bound B, and if there exists 9* ^ O such that 

inf (0-0*) • (h(O)-P) > 0 
6<\e-e*\<i/s 

for every 0 < (5 < 1, the sequence (Oj) converges to 0* almost surely. 

The solution of the maximization problem (5.1) can be expressed in terms of 
the solution of the equation Vh(0) = 0 if the problem is sufficiently regular; that 
is, if the maximum is not achieved on the boundary of the domain O. Note then 
the similarity between (5.3) and (5.24). Since its proposal, this method has seen 
numerous variations. Besides Benveniste et al. (1990) and Bouleau and Lepingle 
(1994), see Wasan (1969), Kersting (1987), Winkler (1995), or Duflo (1996) for 
more detailed references. 

When the function h has several local maxima, the Robbins-Monro procedure 
converges to one of these maxima. In the particular case H(x,0) = h(0) -h xj^fa 
(a > 0), Pflug (1994) examines the relation 

(5.25) 61^+1 = + a h{Bj) + x, , 

when the Xj are iid with E[Xj] = 0 and var(Xj) = T. The relevance of this particular 
case is that, under the conditions 

(i) 3/c > 0, /ci > 0 such that 0 • h(0) < —ki\0\ for \0\ > k , 

(ii) \h(0)\ < k 2 \h( 0 o)\ for K < |^| < |^o| , 

(hi) \h'(0)\ < /C3, 

the stationary measure Ua cissociated with the Markov chain (5.25) weakly converges 
to the distribution with density 



c(T) exp(E(0)/T) , 
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when a goes to 0, T remaining fixed and E being the primitive of h (see Dufio 
1996). The hypotheses (i)-(iii) ensure, in particular, that exp{E{6)/T} is integrable 
on M. Therefore, the limiting distribution of 6j (when a goes to 0) is the so-called 
Gibbs measure with energy function E, which is a pivotal quantity for the simulated 
annealing method introduced in Section 5.2.3. In particular, when T goes to 0, the 
Gibbs measure converges to the uniform distribution on the set of (global) maxima of 
E (see Hwang 1980). This convergence is interesting more because of the connections 
it exhibits with the notions of Gibbs measure and of simulated annealing rather than 
for its practical consequences. The assumptions (i)-(iii) are rather restrictive and 
difficult to check for implicit /I’s, and the representation (5.25) is rather specialized! 
(Note, however, that the completion of h{0) in if (x, 0) is free, since the conditions 
(i)-(iii) relate only to h.) 

5.5.4 Monte Carlo Approximation 

In cases where a function h{x) can be written as E[if (x, Z)] but is not directly 
computable, it can be approximated by the empirical (Monte Garlo) average 

-I ^ 

h[x) = - Y"if(x,^i) , 
m 

i=l 

where the Zi’s are generated from the conditional distribution f(z\x). This approx- 
imation yields a convergent estimator of /i(x) for every value of x, but its use in 
optimization setups is not recommended for at least two related reasons: 

(i) Presumably h{x) needs to be evaluated at many points, which will involve the 
generation of many samples of Z^’s of size m. 

(ii) Since the sample changes with every value of x, the resulting sequence of evalu- 
ations of h will usually not be smooth. 

These difficulties prompted Geyer (1996) to suggest, instead, an importance sam- 
pling approach to this problem, using a single sample of ZiS simulated from g{z) 
and estimate h{x) with 

(5.26) hm{x) =-f2 , 

g{zi) 

where the Z^’s are simulated from g{z). Since this evaluation of h does not depend 
on X, points (i) and (ii) above are answered. 

The problem then shifts from (5.1) to 

(5.27) max hm{x) , 

X 

which leads to a convergent solution in most cases and also allows for the use of 
regular optimization techniques, since the function hm does not vary with each 
iteration. However, three remaining drawbacks of this approach are as follows: 

(i) As hm is expressed as a sum, it most often enjoys fewer analytical properties 
than the original function h. 

(ii) The choice of the importance function g can be very influential in obtaining a 
good approximation of the function h(x). 
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(iii)The number of points Zi used in the approximation should vary with x to achieve 
the same precision in the approximation of h{x)^ but this is usually impossible 
to assess in advance. 

In the case g{z) — f{z\xo), Geyer’s (1996) solution to (ii) is to use a recursive 
process in which xo is updated by the solution of the last optimization at each step. 
The Monte Carlo maximization algorithm then looks like the following: 



Algorithm A. 22 -Monte Carlo Maximization- 



At step i 




1 . Generate zi^. . . ^ 




and compute hg^ with Qx ~ • 




2 . Find x* = arg max (x) . 




3 . Update xt to Xi^i — x * . 


[A.22] 


Repeat until x* = x^+i . 





Example 5.25. Maximum likelihood estimation for exponential families. 

Geyer and Thompson (1992) take advantage of this technique to derive maximum 
likelihood estimators in exponential families; that is, for functions 

h{x\0) = = c{0)h{x\O) . 

However, c{0) may be unknown or difficult to compute. Geyer and Thompson (1992, 
1995) establish that maximization of h{x\0) is equivalent to maximizing 

where the Xi's are generated from h{x\r]) (see Problem 5.11, and Geyer 1993, 1994). 

This representation also extends to setups where the likelihood L{0\x) is known 
up to a multiplicative constant (that is, where L{0\x) oc h{9\x)). || 
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Markov Chains 



Leaphorn never counted on luck. Instead, he expected order — the natural 
sequence of behavior, the cause producing the natural effect, the human 
behaving in the way it was natural for him to behave. He counted on that 
and on his own ability to sort out the chaos of observed facts and find in 
them this natural order. 

—Tony Hillerman, The Blessing Way 



In this chapter we introduce fundamental notions of Markov chains and state 
the results that are needed to establish the convergence of various MCMC 
algorithms and, more generally, to understand the literature on this topic. 
Thus, this chapter, along with basic notions of probability theory, will pro- 
vide enough foundation for the understanding of the following chapters. It 
is, unfortunately, a necessarily brief and, therefore, incomplete introduction 
to Markov chains, and we refer the reader to Meyn and Tweedie (1993), on 
which this chapter is based, for a thorough introduction to Markov chains. 
Other perspectives can be found in Doob (1953), Chung (1960), Feller (1970, 
1971), and Billingsley (1995) for general treatments, and Norris (1997), Num- 
melin (1984), Revuz (1984), and Resnick (1994) for books entirely dedicated 
to Markov chains. Given the purely utilitarian goal of this chapter, its style 
and presentation differ from those of other chapters, especially with regard 
to the plethora of definitions and theorems and to the rarity of examples and 
proofs. In order to make the book accessible to those who are more interested 
in the implementation aspects of MCMC algorithms than in their theoretical 
foundations, we include a preliminary section that contains the essential facts 
about Markov chains. 

Before formally introducing the notion of a Markov chain, note that we 
do not deal in this chapter with Markov models in continuous time (also 
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called Markov processes) since the very nature of simulation leads ^ us to 
consider only discrete-time stochastic processes, (Xn)neN- Indeed, Hastings 
(1970) notes that the use of pseudo-random generators and the representation 
of numbers in a computer imply that the Markov chains related with Markov 
chain Monte Carlo methods are, in fact, finite state-space Markov chains. 
However, we also consider arbitrary state-space Markov chains to allow for 
continuous support distributions and to avoid addressing the problem of ap- 
proximation of these distributions with discrete support distributions, since 
such an approximation depends on both material and algorithmic specifics 
of a given technique (see Roberts et al. 1995, for a study of the infiuence of 
discretization on the convergence of Markov chains associated with Markov 
chain Monte Carlo algorithms). 



6.1 Essentials for MCMC 

For those familiar with the properties of Markov chains, this first section 
provides a brief survey of the properties of Markov chains that are contained 
in the chapter and are essential for the study of MCMC methods. Starting with 
Section 6.2, the theory of Markov chains is developed from first principles. 

In the setup of MCMC algorithms, Markov chains are constructed from 
a transition kernel K (Definition 6.2), a conditional probability density such 
that Xn+i ~ K{Xn^Xn+i). A typical example is provided by the random 
walk process, formally defined as follows. 

Definition 6.1. A sequence of random variables {Xn) is a random walk if it 
satisfies 

A"n+1 = Xn -h Cn , 

where Cn is generated independently of Xn^Xn-i ^ .... If the distribution of 
the €n is symmetric about zero, the sequence is called a symmetric random 
walk. 

There are many examples of random walks (see Examples 6.39, 6.40, and 
6.73), and random walks play a key role in many MCMC algorithms, partic- 
ularly those based on the Metropolis-Hastings algorithm (see Chapter 7). 

The chains encountered in MCMC settings enjoy a very strong stability 
property, namely a stationary probability distribution exists by construction 
(Definition 6.35); that is, a distribution tt such that if Xn ~ tt, then ~ tt, 
if the kernel K allows for free moves all over the state space. (This freedom 
is called irreducibility in the theory of Markov chains and is formalized in 

^ Some Markov chain Monte Carlo algorithms still employ a diffusion representation 
to speed up convergence to the stationary distribution (see, for instance. Section 
7.8.5, Roberts and Tweedie 1995, or Phillips and Smith 1996). 
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Definition 6.13 as the existence of n G N such that P{Xn G A\Xq) > 0 for 
every A such that 7t{A) > 0.) This property also ensures that most of the 
chains involved in MCMC algorithms are recurrent (that is, that the average 
number of visits to an arbitrary set A is infinite (Definition 6.29)), or even 
Harris recurrent (that is, such that the probability of an infinite number of 
returns to A is 1 (Definition 6.32)). Harris recurrence ensures that the chain 
has the same limiting behavior for every starting value instead of almost every 
starting value. (Therefore, this is the Markov chain equivalent of the notion 
of continuity for functions.) 

This latter point is quite important in the context of MCMC algorithms. 
Since most algorithms are started from some arbitrary point xq, we are in 
effect starting the algorithm from a set of measure zero (under a continuous 
dominating measure). Thus, insuring that the chain converges for almost every 
starting point is not enough, and we need Harris recurrence to guarantee 
convergence from every starting point. 

The stationary distribution is also a limiting distribution in the sense that 
the limiting distribution of is tt under the total variation norm (see 

Proposition 6.48), notwithstanding the initial value of Xq. Stronger forms 
of convergence are also encountered in MCMC settings, like geometric and 
uniform convergences (see Definitions 6.54 and 6.58). In a simulation setup, a 
most interesting consequence of this convergence property is that the average 

(6.1) 

n=l 

converges to the expectation ET^[h{X)] almost surely. When the chain is re- 
versible (Definition 6.44) (that is, when the transition kernel is symmetric), a 
Central Limit Theorem also holds for this average. 

In Chapter 12, diagnostics will be based on a minorization condition; that 
is, the existence of a set C such that there also exists m G N, > 0, and a 
probability measure Um such that 

P{Xm ^ ^l^o) > ^m^m{A) 

when Xq g C. The set C is then called a small set (Definition 6.19) and 
visits of the chain to this set can be exploited to create independent batches 
in the sum (6.1), since, with probability Cm, the next value of the m- skeleton 
Markov chain {Xmn)n is generated from the minorizing measure i/m^ which is 
independent of Xq. 

As a final essential, it is sometimes helpful to associate the probabilistic 
language of Markov chains with the statistical language of data analysis. 

Statistics Markov Chain 

marginal distribution invariant distribution 
proper marginals positive recurrent 
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Thus, if the marginals are proper, for convergence we only need our chain to 
be aperiodic. This is easy to satisfy; a sufficient condition is that K{xn, •) > 0 
(or, equivalently, f{'\xn) > 0) in a neighborhood of If the marginals are 
not proper, or if they do not exist, then the chain is not positive recurrent. It 
is either null recurrent or transient, and both cases are bad. 



6.2 Basic Notions 

A Markov chain is a sequence of random variables that can be thought of as 
evolving over time, with probability of a transition depending on the particular 
set in which the chain is. It therefore seems natural and, in fact, is mathemat- 
ically somewhat cleaner to define the chain in terms of its transition kernel^ 
the function that determines these transitions. 

Definition 6.2. A transition kernel is a function K defined on A' x B{X) such 
that 

(i) Vx G A, iC(x, •) is a probability measure; 

(ii) VA G S(A), K{',A) is measurable. 

When ?C is discrete^ the transition kernel simply is a (transition) matrix 
K with elements 



Pxy — — y\^n—l — 5 X, ^ G A. 



In the continuous case, the kernel also denotes the conditional density A"(x, x') 
of the transition K{x, •); that is, P{X G A|x) = K{x,x')dx' . 



Example 6.3. Bernoulli-Laplace Model. Consider A — {0,1,..., M} 
and a chain (A^) such that Xn represents the state, at time n, of a tank which 
contains exactly M particles and is connected to another identical tank. Two 
types of particles are introduced in the system, and there are M of each type. 
If Xn denotes the number of particles of the first kind in the first tank at time 
n and the moves are restricted to a single exchange of particles between the 
two tanks at each instant, the transition matrix is given by (for 0 < x, y < M) 
Pxy = 0 if |x - ^1 > 1, 



Pxx = 2 



x(M — x) 



^x{x — l) 




^x{x-\-l) 



M-xV 

M ) 



and Poi = Pm{m-i) — 1 - (This model is the Bernoulli-Laplace model; see 
Feller 1970, Chapter XV.) || 



The chain {Xn) is usually defined for n G N rather than for n G Z. There- 
fore, the distribution of Aq, the initial state of the chain, plays an important 
role. In the discrete case, where the kernel A is a transition matrix, given an 
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initial distribution /i == • • •)? marginal probability distribution of 

Xi is obtained from the matrix multiplication 

(6.2) /ii = ^iK 

and, by repeated multiplication, Xn ^ /J^n = Similarly, in the continuous 
case, if fi denotes the initial distribution of the chain, namely if 

(6.3) Xo ~ M , 

then we let denote the probability distribution of (Xn) under condition 

(6.3) . When Xq is fixed, in particular for ji equal to the Dirac mass we 
use the alternative notation P^q. 

Definition 6.4. Given a transition kernel K, a sequence Xq, Xi, . . . , X^, . . . 
of random variables is a Markov chain^ denoted by (X^), if, for any t, the 
conditional distribution of Xt given Xt-i,Xt~ 2 , • • • , is the same as the dis- 
tribution of Xt given Xt-i; that is, 

P(X/c+i G A\xo,xi,X 2 ,...,Xk) = P(Xfc+i G A\xk) 

(6.4) = [ K{xk^dx). 

Ja 

The chain is time homogeneous^ or simply homogeneous^ if the distribution 
of (Xt^, . . . ,Xt^) given XtQ is the same as the distribution of {Xt^-t^, Xt 2 -to, 

. . . , Xtj^-to) given xq for every k and every {k + l)-uplet to <ti < • - <tk- 

So, in the case of a Markov chain, if the initial distribution or the initial 
state is known, the construction of the Markov chain (X^) is entirely deter- 
mined by its transition, namely by the distribution of Xn conditionally on 

^n— 1 • 

The study of Markov chains is almost always restricted to the time- 
homogeneous case and we omit this designation in the following. It is, however, 
important to note here that an incorrect implementation of Markov chain 
Monte Carlo algorithms can easily produce nonhomogeneous Markov chains 
for which the standard convergence properties do not apply. (See also the case 
of the ARMS algorithm in Section 7.4.2.) 

Example 6.5. Simulated Annealing. The simulated annealing algorithm 
(see Section 5.2.3 for details) is often implemented in a nonhomogeneous form 
and studied in time-homogeneous form. Given a finite state-space with size 
K, Q = {1,2, ...,X}, an energy function P(-), and a temperature T, the 
simulated annealing Markov chain Xq,Xi, ... is represented by the following 
transition operator: Conditionally on Xn, Y is generated from a fixed proba- 
bility distribution (tti, . . . , ttk) on Q and the new value of the chain is given 
by 
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X — probability exp{(£;(y) - E{Xn))/T} A 1 

otherwise. 

If the temperature T depends on n, the chain is time heterogeneous. || 

Example 6.6. AR(1) Models. AR(1) models provide a simple illustration 
of Markov chains on continuous state-space. If 

Xji = 6Xn—l + 5 ^ G M, 

with Sn ~ A(0, cr^), and if the are independent, Xn is indeed independent 
from Xji- 2 ^ An- 3 , • • • conditionally on Xn-i- The Markovian properties of an 
AR(^) process can be derived by considering the vector {Xn , . . . , An-g+i). On 
the other hand, ARMA(p, q) models do not fit in the Markovian framework 
(see Problem 6.3). || 



In the general case, the fact that the kernel K determines the properties 
of the chain {Xn) can be inferred from the relations 



G Ai) = 

P.((Ai,X2)gAixA2) = 

-fx ((Ai , • • • , An) G A-i X • • • X An) — 



A(x,Ai), 

[ K{y,,A2)K{x,dyi) 

JAi 

I I A(yn— l^An) 

JAi JAr,-l 

X K{x,dyi) ••• K{yn- 2 ,dyn-i) 



In particular, the relation Pa:(Ai G Ai) = K{x,Ai) indicates that K{xn, 
dxn-\-i) is a version of the conditional distribution of An+i given Xn- How- 
ever, as we have defined a Markov chain by first specifying this kernel, we do 
not need to be concerned with different versions of the conditional probabil- 
ities. This is why we noted that constructing the Markov chain through the 
transition kernel was mathematically “cleaner.” (Moreover, in the following 
chapters, we will see that the objects of interest are often these conditional 
distributions, and it is important that we need not worry about different ver- 
sions. Nonetheless, the properties of a Markov chain considered in this chapter 
are independent of the version of the conditional probability chosen.) 

If we denote AT^(x, A) = X(x, A), the kernel for n transitions is given by 
(n > 1) 

(6.5) K^{x,A)= [ K^-\y,A)K{x,dy). 

Jx 



The following result provides convolution formulas of the type ★ 

which are called Chapman-Kolmogorov equations. 
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Lemma 6.7. Chapman-Kolmogorov equations For every {m^n) e 
xeJY,Ae B{X), 



A)= [ K^iy, A) K^{x, dy) . 

Jx 

(In a very informal sense, the Chapman-Kolmogorov equations state that 
to get from x to A in m + n steps, you must pass through some y on the nth 
step.) In the discrete case. Lemma 6.7 is simply interpreted as a matrix prod- 
uct and follows directly from (6.2). In the general case, we need to consider 
K as an operator on the space of integrable functions; that is, we define 

Kh{x)= j h{y)K{x,dy), h G Ci{\) , 

A being the dominating measure of the model. is then the nth composition 
of P, namely = K o 

Definition 6.8. A resolvant associated with the kernel P is a kernel of the 
form 

oo 

Ke{x,A)^{l-e)J2^'^'i^’^)^ 0<e<l, 

i=0 

and the chain with kernel is a K^- chain. 

Given an initial distribution fi^ we can associate with the kernel a chain 
{X^) which formally corresponds to a subchain of the original chain (X^), 
where the indices in the subchain are generated from a geometric distribution 
with parameter 1 — e. Thus, is indeed a kernel, and we will see that the 
resulting Markov chain (X^) enjoys much stronger regularity. This will be 
used later to establish many properties of the original chain. 

If E^[ • ] denotes the expectation associated with the distribution P^, the 
(weak) Markov property can be written as the following result, which just 
rephrases the limited memory properties of a Markov chain: 

Proposition 6.9. Weak M 2 irkov property For every initial distribution 
fji and every (n -h 1) sample (Xq, . . . , X^), 

(6.6) X„+2, . . .)|a;o, . . . , x„] = [h{Xi,X 2 , ...)], 

provided that the expectations exist. 

Note that if h is the indicator function, then this definition is exactly the 
same as Definition 6.4. However, (6.6) can be generalized to other classes of 
functions — hence the terminology “weak” — and it becomes particularly useful 
with the notion of stopping time in the convergence assessment of Markov 
chain Monte Carlo algorithms in Chapter 12. 
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Definition 6.10. Consider A G B{X). The first n for which the chain enters 
the set A is denoted by 

(6.7) ta ~ inf{n > l]Xn G A} 

and is called the stopping time at A with, by convention, ta = +oc if Xn 0 A 
for every n. More generally, a function C(xi,X2, . . .) is called a stopping rule 
if the set {C = n} is measurable for the cr-algebra induced by (Xq, . . . ,Xn). 
Associated with the set A, we also define 

oo 

(6.8) riA = J2^A{Xn), 

n=l 

the number of passages of (Xn) in A. 

Of particular importance are the related quantities Ea;[?7yi] and Px{ta < 
oc), which are the average number of passages in A and the probability of 
return to A in a finite number of steps. 

We will be mostly concerned with stopping rules of the form given in (6.7), 
which express the fact that ta takes the value n when none of the values of 
Xo, Xi, . . . , X^-i are in the given state (or set) A, but the nth value is. 
The strong Markov property corresponds to the following result, whose proof 
follows from the weak Markov property and conditioning on {C = n}: 

Proposition 6.11. Strong Markov property For every initial distribution 
fi and every stopping time C, which is almost surely finite, 

E/^[/i(X^+i,X^+ 2, • • • )ki^ ExJ/i(Xi,X2, • • • )] 5 

provided the expectations exist. 

We can thus condition on a random number of instants while keeping the 
fundamental properties of a Markov chain. 

Example 6.12. Coin tossing. In a coin tossing game, player b has a gain 
of +1 if a head appears and player c has a gain of 4-1 if a tail appears (so 
player b has a “gain” of —1 (a loss) if a tail appears). If Xn is the sum of 
the gains of player b after n rounds of this coin tossing game, the transition 
matrix P is an infinite dimensional matrix with upper and lower subdiagonals 
equal to 1/2. Assume that player b has B dollars and player c has C dollars, 
and consider the following return times: 

Ti = inf{n; Xn = 0}, T2 = inf{n; Xn < -B], ts = inf{n; Xn > C}, 

which represent respectively the return to null and the ruins of the first and 
second players, that is to say, the first times the fortunes of both players, 
respectively B and C, are spent. The probability of ruin (bankruptcy) for 
the first player is then Pq{t2 > T3). (Feller 1970, Chapter III, has a detailed 
analysis of this coin tossing game.) || 
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6.3 Irreducibility, Atoms, and Small Sets 



6.3.1 Irreducibility 



The property of irreducibility is a first measure of the sensitivity of the Markov 
chain to the initial conditions, xq or fi. It is crucial in the setup of Markov chain 
Monte Carlo algorithms, because it leads to a guarantee of convergence, thus 
avoiding a detailed study of the transition operator, which would otherwise 
be necessary to specify “acceptable” initial conditions. 

In the discrete case, the chain is irreducible if all states communicate, 
namely if 

Px{ry < oo) > 0 , Vx,y G A' , 

Ty being the first time y is visited, defined in (6.7). In many cases, Px{Ty < oo) 
is uniformly equal to zero, and it is necessary to introduce an auxiliary measure 
(/? on S(A') to correctly define the notion of irreducibility. 

Definition 6.13. Given a measure (/?, the Markov chain (Xn) with transition 
kernel K{x^y) is (f- irreducible if, for every A G B{X) with (^(A) > 0, there 
exists n such that K^{x, A) > 0 for all x G A (equivalently, Px{ta < oo) > 0). 
The chain is strongly (f -irreducible \i n = l for all measurable A. 

Example 6.14. (Continuation of Example 6.3) In the case of the 
Bernoulli-Laplace model, the (finite) chain is indeed irreducible since it is 
possible to connect the states x and y m\x — y\ steps with probability 



x\/y—l 

n 

i=xAy 



K 



K 



The following result provides equivalent definitions of irreducibility. The 
proof is left to Problem 6.13, and follows from (6.9) and the Chapman- 
Kolmogorov equations. 

Theorem 6.15. The chain (Xn) is -irreducible if and only if for every x G X 
and every A G B{X) such that p{A) > 0, one of the following properties holds: 

(i) there exists n G N* such that K'^{x, A) > 0; 

(ii) Ex[j?>i] > 0; 

(Hi) Ke{x, A) > 0 for an 0 < e < 1. 

The introduction of the K^-chain then allows for the creation of a strictly 
positive kernel in the case of a (^-irreducible chain and this property is used 
in the following to simplify the proofs. Moreover, the measure p in Definition 
6.13 plays no crucial role in the sense that irreducibility is an intrinsic property 
of {Xn) and does not rely on p. 

The following theorem details the properties of the maximal irreducibility 
measure 'll). 
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Theorem 6.16. If {Xn) is (f -irreducible, there exists a probability measure -0 
such that: 

(i) the Markov chain (Xn) is 'll) -irreducible; 

(a) if there exists a measure ^ such that (Xn) is -irreducible, then ^ is dom- 
inated by 'll); that is, ^ <C 0; 

(Hi) if'ip(A) = 0, then 'ip{{y; Py{rA < oo) > 0}) = 0; 

(iv) the measure 0 is equivalent to 

(6.9) 0o(yl) = [ A) v9(dx), VA G B{X) ; 

JX 

that is, 'll) <^'il)o and 0o ^ 0* 

This result provides a constructive method of determining the maximal 
irreducibility measure 0 through a candidate measure (p, which still needs to 
be defined. 

Example 6.17. (Continuation of Example 6.6) When = OXn -f 

Sn-\-i and Sn are independent normal variables, the chain is irreducible, the 
reference measure being the Lebesgue measure, A. (In fact, K{x,A) > 0 for 
every x e R and every A such that X{A) > 0.) On the other hand, if Sn 
is uniform on [—1,1] and \9\ > 1, the chain is not irreducible anymore. For 
instance, if ^ > 1, then 



Xn+i -Xn>(0- l)Xn - 1 > 0 



for Xn > l/{6 — 1). The chain is thus monotonically increasing and obviously 
cannot visit previous values. || 



6.3.2 Atoms and Small Sets 

In the discrete case, the transition kernel is necessarily atomic in the usual 
sense; that is, there exist points in the state-space with positive mass. The 
extension of this notion to the general case by Nummelin (1978) is powerful 
enough to allow for a control of the chain which is almost as “precise” as in 
the discrete case. 

Definition 6.18. The Markov chain (Xn) has an atom a G B{X) if there 
exists an associated nonzero measure u such that 

K{x, A) = i^iA), Vx G a, VA G B{X) . 

If (An) is 0-irreducible, the atom is accessible when 0(a) > 0. 
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While it trivially applies to every possible value of Xn in the discrete case, 
this notion is often too strong to be of use in the continuous case since it 
implies that the transition kernel is constant on a set of positive measure. 
A more powerful generalization is the so-called minorizing condition^ namely 
that there exists a set C G S(A'), a constant £ > 0, and a probability measure 
u such that 

(6.10) K{x, A) > eu{A), Vx G C, VA G B{X) . 

The probability measure v thus appears as a constant component of the tran- 
sition kernel on C. The minorizing condition (6.10) leads to the following 
notion, which is essential in this chapter and in Chapters 7 and 12 as a tech- 
nique of proof and as the basis of renewal theory. 

Definition 6.19. A set C is small if there exist m G N* and a nonzero mea- 
sure Vjri such that 

(6.11) A) > Vm{A), Vx G C, VA G B{X) . 

Example 6.20. (Continuation of Example 6.17) Since Xn\xn-i ~ 
A/’(0Xn-i, cr^), the transition kernel is bounded from below by 

— exp {(— x^ + 2^Xnl^ — A u;^)/2cr^} if Xn > 0, 

(7V 27T 

— exp {(— x^ + 0XnW“^ — A u;^)/2cr^} if x^ < 0, 
cry 27T 

when Xn-i G [w^w]. The set C = [w^w] is indeed a small set, as the measure 
1 ^ 1 , with density 

exp{(— x^ -h 26xw)/2a‘^} Ia;>o + exp{(— x^ + 26xw)/2a‘^} Ia;<o 
a[^{—6w/ cr^) exp{6‘^np /2a‘^} [1 — ^{—6w/a‘^)] exp{6‘^w‘^ /2a‘^}] 

and 



e = exp{—6‘^u^/2a^} [^{—6w/a‘^) exp{6^w^ /2a^} 

[1 — ^{—6w/(t‘^)] exp{9^w^ /2a^}] , 

satisfy (6.11) with m = 1. || 

A sufficient condition for C to be small is that (6.11) is satisfied by the 
chain in the special case m = l. The following result indicates the connection 
between small sets and irreducibility. 

Theorem 6.21. Let {Xn) be a 'll; -irreducible chain. For every set A E B{X) 
such that 'll’ {A) > 0 , there exist m G N* and a small set C d A such that 
the associated minorizing measure satisfies ^'m(C) > 0 . Moreover, X can be 
decomposed in a denumerable partition of small sets. 
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The proof of this characterization result is rather involved (see Meyn and 
Tweedie 1993, pp. 107-109). The decomposition of A' as a denumerable union 
of small sets is based on an arbitrary small set C and the sequence 

Cnm = {y\K^{y.C)>l/rn} 



(see Problem 6.19). 

Small sets are obviously easier to exhibit than atoms, given the freedom 
allowed by the minorizing condition (6.11). Moreover, they are, in fact, very 
common since, in addition to Theorem 6.21, Meyn and Tweedie (1993, p. 
134) show that for sufficiently regular (in a topological sense) Markov chains, 
every compact set is small. Atoms, although a special case of small sets, enjoy 
stronger stability properties since the transition probability is invariant on a. 
However, splitting methods (see below) offer the possibility of extending most 
of these properties to the general case and it will be used as a technique of 
proof in the remainder of the chapter. 

If the minorizing condition holds for (A^), there are two ways of deriving 
a companion Markov chain (An) sharing many properties with (An) and 
possessing an atom a. The first method is called Nummelin’s splitting and 
constructs a chain made of two copies of (An) (see Nummelin 1978 and Meyn 
and Tweedie 1993, Section 5.1). 

A second method, discovered at approximately the same time, is due to 
Athreya and Ney (1978) and uses a stopping time to create an atom. We prefer 
to focus on this latter method because it is related to notions of renewal time^ 
which are also useful in the control of Markov chain Monte Carlo algorithms 
(see Section 12.2.3). 

Definition 6.22. A renewal time (or regeneration time) is a stopping rule r 
with the property that (A^-, A^-^.!, . . .) is independent of (A^-.i, A^- 2 , • • •)• 

For instance, in Example 6.12, the returns to zero gain are renewal times. 
The excursions between two returns to zero are independent and identically 
distributed (see Feller 1970, Chapter III). More generally, visits to atoms are 
renewal times, whose features are quite appealing in convergence control for 
Markov chain Monte Carlo algorithms (see Chapter 12). 

If (6.10) holds and if the probability Px{tc < oo) of a return to C in a 
finite time is identically equal to 1 on A, Athreya and Ney (1978) modify the 
transition kernel when A^ G C, by simulating A^+i as 



( 6 . 12 ) 



An-fl ~ 



V 

KjXnr) - su{^) 



1 — s 



with probability £ 
with probability I — e; 



that is, by simulating A^+i from u with probability £ every time A^ is in C. 
This modification does not change the marginal distribution of An+i condi- 
tionally on Xn, since 
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eu{A) + (1 - e) = K{xn, A), yA&B{X), 

but it produces renewal times for each time j such that Xj G C and Xj^i 
~ ly. 

Now we clearly see how the renewal times result in independent chains. 
When Xj^i ~ this event is totally independent of any past history, 'as 
the current state of the chain has no effect on the measure u. Note also the 
key role that is played by the minorization condition. It allows us to create 
the split chain with the same marginal distribution as the original chain. We 
denote by (j > 0) 

Tj = inf{n > Tj-i] Xn G C and Xn-\-i ~ i'} 

the sequence of renewal times with tq = 0. Athreya and Ney (1978) introduce 
the augmented chain^ also called the split chain Xn = with ujn = I 

when Xn G C and is generated from u. It is then easy to show that the 
set Q! = C X {1} is an atom of the chain (X^), the resulting subchain (Xn) 
being still a Markov chain with transition kernel K{xn, •) (see Problem 6.17). 

The notion of small set is useful only in finite and discrete settings when 
the individual probabilities of states are too small to allow for a reasonable 
rate of renewal. In these cases, small sets are made of collections of states with 
u defined as a minimum. Otherwise, small sets reduced to a single value are 
also atoms. 

6.3.3 Cycles and Aperiodicity 

The behavior of {Xn) may sometimes be restricted by deterministic con- 
straints on the moves from Xn to We formalize these constraints here 

and show in the following chapters that the chains produced by Markov chain 
Monte Carlo algorithms do not display this behavior and, hence, do not suffer 
from the associated drawbacks. 

In the discrete case, the period of a state a; G X is defined as 

d{u) = g.c.d. {m > l]K^{uo^uj) > 0} , 

where we recall that g.c.d. is the greatest common denominator. The value of 
the period is constant on all states that communicate with uo. In the case of 
an irreducible chain on a finite space X, the transition matrix can be written 
(with a possible reordering of the states) as a block matrix 

/ 0 Di 0 

0 0 Z)2 

(6.13) P = 

\Dd 0 0 
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where the blocks Di are stochastic matrices. This representation clearly illus- 
trates the forced passage from one group of states to another, with a return 
to the initial group occurring every dth step. If the chain is irreducible (so all 
states communicate), there is only one value for the period. An irreducible 
chain is aperiodic if it has period 1 . The extension to the general case requires 
the existence of a small set. 

Definition 6.23. A '0-irreducible chain {Xn) has a cycle of length d if there 
exists a small set (7, an associated integer M, and a probability distribution 
I'M such that d is the g.c.d. of 

{m >1; 3 > 0 such that C is small for Um > 

A decomposition like (6.13) can be established in general. It is easily shown 
that the number d is independent of the small set C and that this number 
intrinsically characterizes the chain (Xn). The period of (Xn) is then defined 
as the largest integer d satisfying Definition 6.23 and (Xn) is aperiodic \id= 1. 
If there exists a small set A and a minorizing measure vi such that {A) > 0 
(so it is possible to go from A to A in a single step), the chain is said to 
be strongly aperiodic). Note that the X^-chain can be used to transform an 
aperiodic chain into a strongly aperiodic chain. 

In discrete setups, if one state x £ A satisfies Pxx > 0, the chain (Xn) is 
aperiodic, although this is not a necessary condition (see Problem 6.35). 

Example 6.24. (Continuation of Example 6.14) The Bernoulli-Laplace 
chain is aperiodic and even strongly aperiodic since the diagonal terms satisfy 
Pxx > 0 for every x G {0, . . . , K}. || 

When the chain is continuous and the transition kernel has a component 
which is absolutely continuous with respect to the Lebesgue measure, with 
density /(-fyn), a sufficient condition for aperiodicity is that /(-fy^i) is positive 
in a neighborhood of The chain can then remain in this neighborhood for 
an arbitrary number of instants before visiting any set A. For instance, in 
Example 6.3, (X^) is strongly aperiodic when Sn is distributed according to 
and 1^1 < 1 (in order to guarantee irreducibility). The next chapters 
will demonstrate that Markov chain Monte Carlo algorithms lead to aperiodic 
chains, possibly via the introduction of additional steps. 



6.4 Transience and Recurrence 

6.4.1 Classification of Irreducible Chains 

From an algorithmic point of view, a Markov chain must enjoy good stability 
properties to guarantee an acceptable approximation of the simulated model. 
Indeed, irreducibility ensures that every set A will be visited by the Markov 
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chain (X^), but this property is too weak to ensure that the trajectory of {Xn) 
will enter A often enough. Consider, for instance, a maximization problem 
using a random walk on the surface of the function to maximize (see Chapter 
5). The convergence to the global maximum cannot be guaranteed without 
a systematic sweep of this surface. Formalizing this stability of the Markov 
chain leads to different notions of recurrence. In a discrete setup, the recurrence 
of a state is equivalent to a guarantee of a sure return. This notion is thus 
necessarily satisfied for irreducible chains on a finite space. 

Definition 6.25. In a finite state-space A’, a state u £ X is transient if the 
average number of visits to a;, Euj[rj^], is finite, and recurrent if = oc. 

For irreducible chains, the properties of recurrence and transience are prop- 
erties of the chain, not of a particular state. This fact is easily deduced from 
the Chapman-Kolmogorov equations. Therefore, if rfA denotes the number of 
visits defined in (6.8), for every (x,y) G either Ea;[ryy] < oo in the tran- 
sient case or Ex[rjy] = oc in the recurrent case. The chain is then said to be 
transient or recurrent^ one of the two properties being necessarily satisfied in 
the irreducible case. 

Example 6.26. Branching process. Consider a population whose individ- 
uals reproduce independently of one another. Each individual has X sibling(s), 
X G N, distributed according to the distribution with generating function 
<j){s) = E[s^]. If individuals reproduce at fixed instants (thus defining gener- 
ations), the size of the tth generation 5t (t > 1) is given by 

where the Xi ^ (f) are independent. Starting with a single individual at time 
0, Si = Xi, the generating function of St is gt{s) = 0^(s), with (jf = (f) o 0^“^ 
{t > 1). The chain (St) is an example of a branching process (see Feller 1971, 
Chapter XII). 

If 0 does not have a constant term (i.e., if P{Xi = 0) = 0), the chain (St) 
is necessarily transient since it is increasing. If P{Xi = 0) > 0, the probability 
of a return to 0 at time t is pt = P{St = 0) = gt{0), which thus satisfies the 
recurrence equation pt = (j){pt-i). Therefore, there exists a limit p different 
from 1, such that p = (t){p), if and only if 0'(1) > 1; namely if E[X] > 1. The 
chain is thus transient when the average number of siblings per individual is 
larger than 1. If there exists a restarting mechanism in 0, = 0 ~ 0, 

it is easily shown that when 0'(1) > 1, the number of returns to 0 follows a 
geometric distribution with parameter p. If 0'(1) < 1, one can show that the 
chain is recurrent (see Example 6.42). || 

The treatment of the general (that is to say, non-discrete) case is based on 
chains with atoms, the extension to general chains (with small sets) following 
from Athreya and Ney’s (1978) splitting. We begin by extending the notions 
of recurrence and transience. 
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Definition 6.27. A set A is called recurrent if Ea;[77^] = +oo for every x G 
A. The set A is uniformly transient if there exists a constant M such that 
Ex[^a] < Af for every x G A. It is transient if there exists a covering of X by 
uniformly transient sets; that is, a countable collection of uniformly transient 
sets Bi such that 



Theorem 6.28. Let be xIj - irreducible Markov chain with an accessible 

atom a. 

(i) If a is recurrent, every set A of B{X) such that 'ip{A) > 0 is recurrent, 
(a) If a is transient, X is transient. 

Property (i) is the most relevant in the Markov chain Monte Carlo setup 
and can be derived from the Chapman-Kolmogorov equations. Property (ii) 
is more difficult to' Establish and uses the fact that Pai^a < oo) < 1 for a 
transient set when Ex[r]A] is decomposed conditionally on the last visit to a 
(see Meyn and Tweedie 1993, p. 181, and Problem 6.29). 

Definition 6.29. A Markov chain {Xn) is recurrent if 

(i) there exists a measure such that {Xn) is '0-irreducible, and 

(ii) for every A G B{X) such that 'f>{A) > 0, Ex[t}a] = oo for every x e A. 

The chain is transient if it is V^-irreducible and if X is transient. 

The classification result of Theorem 6.28 can be easily extended to strongly 
aperiodic chains since they satisfy a minorizing condition (6.11), thus can be 
split as in (6.3.2), while the chain {Xn) and its split version {Xn) (see Problem 
6.17) are either both recurrent or both transient. The generalization to an 
arbitrary irreducible chain follows from the properties of the corresponding 
K^-chain which is strongly aperiodic, through the relation 

OD ^ OO 

(6.14) ^ 

n=0 n=0 

since 

oo oo 

EM = = 

n=0 n=0 

This provides us with the following classification result: 

Theorem 6.30. A 'll -irreducible chain is either recurrent or transient. 
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6.4.2 Criteria for Recurrence 



The previous results establish a clear dichotomy between transience and re- 
currence for irreducible Markov chains. Nevertheless, given the requirement 
of Definition 6.29, it is useful to examine simpler criteria for recurrence. By 
analogy with discrete state-space Markov chains, a first approach is based on 
small sets. 



Proposition 6.31. A 'ip -irreducible chain {Xn) is recurrent if there exists a 
small set C with 'ip{C) > 0 such that Px{tc < oo) = 1 for every x G C. 

Proof First, we show that the set C is recurrent. Given x G C, consider 
= K'^{x,C) and fn = Px{Xn G C^Xn-i ^ C, . . . ,Xi ^ C), which is the 
probability of first visit to C at the nth instant, and define 

oo oo 

i/(s) = 1 + ^ and Q{s) = fnS^. 

n—1 n=l 

The equation 



(6.15) 



'^n — /n + fn-l'^1 + * * * + flUn-l 



describes the relation between the probability of a visit of C at time n and 
the probabilities of first visit of C. This implies 



U{s) 



1 

1 - Q{s) ’ 



which connects U{1) = Ea^[r/c] = oo with Q(l) = Px{tc < oo) = 1. Equation 
(6.15) is, in fact, valid for the split chain (Xn) (see Problem 6.17), since a visit 
to C X {0} ensures independence by renewal. Since Ex[rjc], associated with 
{Xn), is larger than Ex[rjcx{o}]^ associated with (x^), and Px{rc < oo) for 
{Xn) is equal to Px{tcx{o} < oo) for (X^), the recurrence can be extended 
from (xn) to {Xn). The recurrence of {Xn) follows from Theorem 6.28, since 
C X {0} is a recurrent atom for {Xn). □□ 

A second method of checking recurrence is based on a generalization of the 
notions of small sets and minorizing conditions. This generalization involves 
a potential function V and a drift condition like (6.38) and uses the transition 
kernel AT(-, •) rather than the sequence K^. Note 6.9.1 details this approach, 
as well as its bearing on the following stability and convergence results. 



6.4.3 Harris Recurrence 

It is actually possible to strengthen the stability properties of a chain {Xn) by 
requiring not only an infinite average number of visits to every small set but 
also an infinite number of visits for every path of the Markov chain. Recall 
that t]a is the number of passages of (Xn) in A, and we consider Px{t]a = oo), 
the probability of visiting A an infinite number of times. The following notion 
of recurrence was introduced by Harris (1956). 
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Definition 6.32. A set A is Harris recurrent Px{tja = oo) = 1 for all x £ A. 
The chain {Xn) is Harris recurrent if there exists a measure 'ip such that {Xn) 
is V^-irreducible and for every set A with pj{A) >0, A is Harris recurrent. 

Recall that recurrence corresponds to Kx[r]a] = oo, a weaker condition 
than Px{r)A = oc) = 1 (see Problem 6.30). The following proposition expresses 
Harris recurrence as a condition on Px{ta < oo) defined in (6.8). 

Proposition 6.33. If for every A G B{X), Px{ta < oo) = 1 for every x e A, 
then PxiVA = oo) — 1, for all x e X, and (Xn) is Harris recurrent. 

Proof The average number of visits to B before a first visit to A is 

oo 

( 6 . 16 ) Ua{x,B) = ^ P,{Xn eB,TA>n). 

71=1 

Then, Ua{x,A) = Px{ta < oo), since, if H C A, Px{Xn ^ B,ta > n) = 
Px{Xn G H, r = n) = Px{tb = '^)- Similarly, if r^(/c). A: > 1, denotes the time 
of the kth visit to A, TA{k) satisfies 

Px{rA{2) <oo)= Py{rA < oo) UA{x,dy) = 1 
JA 

for every x E A and, by induction, 

Px{rA{k + 1) < oo) == / Px{rA{k) < oo) Ua{x, dy) = 1. 

JA 

Since PxiVA ^ k) = Px{TA{k) < oo) and 

PxiVA = oo) = lim PxiVA > k), 

k-^oo 

we deduce that Px{r]A = oo) = 1 for x G A. □ 

Note that the property of Harris recurrence is needed only when X is 
not denumerable. If X is finite or denumerable, we can indeed show that 
^x[Vx] = oo if and only if Px{tx < oo) = 1 for every x G A, through an 
argument similar to the proof of Proposition 6.31. In the general case, it is 
possible to prove that if {Xn) is Harris recurrent, then Px{rjB = oo) = 1 for 
every x E X and B E B{X) such that ip{B) > 0. This property then provides 
a sufficient condition for Harris recurrence which generalizes Proposition 6.31. 

Theorem 6.34. If {Xn) is a ip -irreducible Markov chain with a small set C 
such that Px{tc < oo) = 1 for all x E X, then {Xn) is Harris recurrent. 

Contrast this theorem with Proposition 6.31, where Px{tc < oo) = 1 
only for X G C. This theorem also allows us to replace recurrence by Harris 
recurrence in Theorem 6.72. (See Meyn and Tweedie 1993, pp. 204-205 for 
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a discussion of the “almost” Harris recurrence of recurrent chains.) Tierney 
(1994) and Chan and Geyer (1994) analyze the role of Harris recurrence in the 
setup of Markov chain Monte Carlo algorithms and note that Harris recurrence 
holds for most of these algorithms (see Chapters 7 and 10).^ 



6.5 Invariant Measures 

6.5.1 Stationary Chains 

An increased level of stability for the chain (Xn) is attained if the marginal 
distribution of Xn is independent of n. More formally, this is a requirement for 
the existence of a probability distribution tt such that ~ tt if Xn ~ tt, 

and Markov chain Monte Carlo methods are based on the fact that this re- 
quirement, which defines a particular kind of recurrence called positive recur- 
rence^ can be met. The Markov chains constructed from Markov chain Monte 
Carlo algorithms enjoy this greater stability property (except in very patho- 
logical cases; see Section 10.4.3). We therefore provide an abridged description 
of invariant measures and positive recurrence. 

Definition 6.35. A cr-finite measure tt is invariant for the transition kernel 
X(-, •) (and for the associated chain) if 

7t{B) = [ K{x, B) 7r{dx) , VH G B{X) . 

Jx 

When there exists an invariant probability measure for a -^-irreducible (hence 
recurrent by Theorem 6.30) chain, the chain is positive. Recurrent chains that 
do not allow for a finite invariant measure are called null recurrent. 

The invariant distribution is also referred to as stationary if tt is a probability 
measure, since Xq ~ tt implies that Xn ~ tt for every n; thus, the chain is 
stationary in distribution. (Note that the alternative case when tt is not finite is 
more difficult to interpret in terms of behavior of the chain.) It is easy to show 
that if the chain is irreducible and allows for an cr-finite invariant measure, 
this measure is unique, up to a multiplicative factor (see Problem 6.60). The 
link between positivity and recurrence is given by the following result, which 
formalizes the intuition that the existence of a invariant measure prevents the 
probability mass from “escaping to infinity.” 

Proposition 6.36. If the chain (Xn) is positive, it is recurrent. 

^ Chan and Geyer (1994) particularly stress that “Harris recurrence essentially says 
that there is no measure-theoretic pathology (...) The main point about Harris 
recurrence is that asymptotics do not depend on the starting distribution because 
of the ^ split ’ chain construction. ” 
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Proof. If (Xn) is transient, there exists a covering of X by uniformly transient 
sets, Aj^ with corresponding bounds 

^xlvAj] < Mj, \/x e Aj, Vj e N . 

Therefore, by the invariance of tt, 

7r{Aj) = j K{x,Aj)'K{dx)— J K'^{x, Aj) 7r{dx). 

Therefore, for every A: G N, 

J K'^{x,Aj) 7r{dx) < 

since, from (6.8) it follows that Yln=o ^j) — ^xIvAj]- Letting k go to 

oo shows that 7r{Aj) = 0, for every j G N, and hence the impossibility of 
obtaining an invariant probability measure. □ 

We may, therefore, talk of positive chains and of Harris positive chains., 
without the superfluous denomination recurrent and Harris recurrent. Pro- 
position 6.36 is useful only when the positivity of (Xn) can be proved, but, 
again, the chains produced by Markov chain Monte Carlo methods are, by 
nature, guaranteed to possess an invariant distribution. 



j Ej;[r]Aj]TT{dx) < Mj, 



k 



6.5.2 Kac’s Theorem 

A classical result (see Feller 1970) on irreducible Markov chains with discrete 
state-space is that the stationary distribution, when it exists, is given by 

= (Eo:[Tx])“^ , X 

where, from (6.7), we can interpret Ea;[ra;] as the average number of excursions 
between two passages in x. (It is sometimes called Kac's Theorem.) It also 
follows that (Ex[ra;]“^) is the eigenvector associated with the eigenvalue 1 for 
the transition matrix P (see Problems 6.10 and 6.61). We now establish this 
result in the more general case when {Xn) has an atom, a. 

Theorem 6.37. Let (X^) he 'll) -irreducible with an atom a. The chain is pos- 
itive if and only ifEc^[roc] < oo. In this case, the invariant distribution tt for 
(Xn) satisfies 

TT{a) = (Ea[ra])~^ . 

The notation Eq,[ • ] is legitimate in this case since the transition kernel 
is the same for every x e a (see Deflnition 6.18). Moreover, Theorem 6.37 
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indicates how positivity is a stability property stronger than recurrence. In 
fact, the latter corresponds to 



Pa{Ta = OC) == 0, 
which is a necessary condition for Ea[ra] < oo. 

Proof. If Ea[ra] < oo, Pa(^a < oc) = 1; thus, (Xr^) is recurrent from Propo- 
sition 6.31. Consider a measure tt given by 

oo 

(6.17) 7t(A) = Pc,{Xr, € A, > n) 

n=l 

as in (6.16). This measure is invariant since 7r(o;) = Pai^a < oo) = 1 and 



/ K{x,A)Tr{dx) = Tr{a)K{a,A) + K{xn,A) Pa{Ta > n,dxn) 

J n=l 

OO 

= K{a,A) + Pa{X„ e A,Ta>n) = tt{A). 



n=2 



It is also finite as 

oo oo oo 

^(-^) = - *^) = X] X ^ 

n=l n=l m=n 

oo 

“ ^ = m) — Ec,[tc] < oo . 

m=l 

Since tt is invariant when (Xn) is positive, the uniqueness of the invariant 
distribution implies finiteness of Tr(A'), thus of Ea[ra]- Renormalizing tt to 
ixj'KiX) implies 7 t(q;) = (Eafrc^])-^. □ 

Following a now “classical” approach, the general case can be treated by 
splitting {Xn) to {Xn) (which has an atom) and the invariant measure of 
{Xn) induces an invariant measure for {Xn) by marginalization. A converse 
of Proposition 6.31 establishes the generality of invariance for Markov chains 
(see Meyn and Tweedie 1993, pp. 240-245, for a proof). 

Theorem 6.38. If{Xn) is a recurrent chain, there exists an invariant a -finite 
measure which is unique up to a multiplicative factor. 

Example 6.39. Random walk on M. Consider the random walk on R, 
Xn+i = Xn -h Wn^ where Wn has a cdf P. Since K{x, •) is the distribution 
with cdf P{y — x), the distribution of is invariant by translation, and 

this implies that the Lebesgue measure is an invariant measure: 

f K{x,A)\{dx) = f J 



P{dy)\{dx) 
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= / ndy) I lA-y{x)X{dx) = A(^) . 

Moreover, the invariance of A and the uniqueness of the invariant measure 
imply that the chain (X^) cannot be positive recurrent (in fact, it can be 
shown that it is null recurrent). || 



Example 6.40. Random walk on Z. A random walk on Z is defined by 



Xtt,-!-! — Xfi -|- Wri’) 

the perturbations Wn being iid with distribution jk = PiWn = k), k e 
Z. With the same kind of argument as in Example 6.39, since the counting 
measure on Z is invariant for (X^i), (X^) cannot be positive. If the distribution 
of Wn is symmetric, straightforward arguments lead to the conclusion that 

oo 

Po{Xn = 0 ) = 00 , 

n=l 

from which we derive the (null) recurrence of (X^) (see Feller 1970, Durrett 
1991, or Problem 6.25). || 



Example 6.41. (Continuation of Example 6.24) Given the quasi-diagonal 
shape of the transition matrix, it is possible to directly determine the invariant 
distribution, n = (ttq, . . . , ttk)- In fact, it follows from the equation P^tt = tt 
that 



TTO — ^OOTTO + PloTTi, 

7Tl = PblTTo + PllTTi + P217T2, 



TTK = P{K-1)K'^K-1 -h PkKTTk • 



Therefore, 





Poi 




7Ti == 


■rio 






Po\Pl2 




7T2 


P P 






Poi • • • P(k- 


-l)k 




Pk{k—1) 


Pio 



7To . 



Hence, 
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TTfe = 1 ^ ) 7To, k = 0,...,K, 



and through normalization, 



TTk 



2K 

k 



which implies that the hypergeometric distribution H{2K^K^lf2) is the in- 
variant distribution for the Bernoulli-Laplace model. Therefore, the chain is 
positive. II 



Example 6.42. (Continuation of Example 6.26) Assume /'(I) < 1. If 
there exists an invariant distribution for (5t), its characteristic function g 
satisfies 

(6-18) 9{s) = /(s)ff(O) + g[f{s)] - g{0) . 

In the simplest case, that is to say, when the number of siblings of a given 
individual is distributed according to a Bernoulli distribution B(p), f{s) = 
q + ps, where q=l — p^ and g{s) is solution of 

(6.19) g{s)= g{q + ps)+p{s-l)g{0) . 

Iteratively substituting (6.18) into (6.19), we obtain 

g{s) = g[q + p{q + ps)] + p{q + ps - l)g{0) + p{s - l)g(O) 

= g(g + pg-i — +p'"~^q+p'‘s) + (p-i f-p'‘)(s - i)p(o) , 

for every k e N. Letting k go to infinity, we have 

g(s) = g[g/(i - p)] + [p/(i - p)](s - i)g(o) 

= 1 + ^ (s - 1)5(0) , 

since q/(l —p) = 1 and ^(1) = 1. Substituting 5 = 0 implies ^(0) = q and, 
hence, g[s) = 1 -f- p{s — 1) = q-\- ps. The Bernoulli distribution is thus the 
invariant distribution and the chain is positive. || 



Example 6.43. (Continuation of Example 6.20) Given that the transi- 
tion kernel corresponds to the distribution, a normal distribu- 

tion A/^(/x, r^) is stationary for the AR(1) chain only if 

(i = Ofi and -h cr^ . 
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0 = 0.4 0 = 0.8 




Fig. 6.1. Trajectories of four AR(1) chains with a = 1. The first three panels show 
positive recurrent chains, and a 6 increases the chain gets closer to transience. The 
fourth chain with 0 = 1.0001 is transient, and the trajectory never returns. 



These conditions imply that /i = 0 and that = cr^/(l — which can 
only occur for |^| < 1. In this case, A/’(0,cr^/(l — 0^)) is indeed the unique 
stationary distribution of the AR(1) chain. 

So if 1^1 < 1, the marginal distribution of the chain is a proper density 
independent of n, and the chain is positive (hence recurrent). Figure 6.1 shows 
the two-dimensional trajectories of an AR(1) chain, where each coordinate 
is a univariate AR(1) chain. (We use two dimensions to better graphically 
illustrate the behavior of the chain.) 

In the first three panels of Figure 6.1 we see increasing but all three are 
positive recurrent. This results in the chain “filling” the space; and we can 
see as 0 increases there is less dense filling in. Finally, the fourth chain, with 
0 = 1.001, is transient, and not only it does not fill the space, but it escapes 
and never returns. Note the scale on that panel. 

When we use a Markov chain to explore a space, we want it to fill the 
space. Thus, we want our MCMC chains to be positive recurrent. || 

Note that the converse to Proposition 6.38 does not hold: there exist tran- 
sient Markov chains with stationary measures. For instance, the random walks 
in and Z^, corresponding to Examples 6.39 and 6.40, respectively, are both 
transient and have the Lebesgue and the counting measures as stationary mea- 
sures (see Problem 6.25). 

In the case of a general Harris recurrent irreducible and aperiodic Markov 
chain (X^) with stationary distribution tt, Robert and Robert (2004) propose 
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a representation of tt that is associated with Kac’s Theorem, under the form 



(6.20) tt{A) = Pr(ATt g A) Pr(T* = t ) , 

t=l 

where T* is an integer valued random variable and, for each t G N, A^t is 
a random variable whose distribution depends on t. This representation is 
very closely related to Kac’s (6.17), but offers a wider range of applicability 
since it does not require the chain to have an atom. We simply assume that 
the minorizing condition (6.10) is satisfied (and for simplicity’s sake we take 
m = 1 in (6.10)). The random variables Nt and T* can then be defined in 
terms of the split chain of Section 6.3.2. If rc > 1 denotes the renewal time 
associated with the small set C in (6.10), then Ei,[tc] < oo by recurrence 
(Problem 6.32), and T* is given by the tail probabilities of rc as 



The random variable Nt is then logically distributed from u if t = 1 and as 
Xt conditional on no renewal before time t otherwise, following from (6.10). 
Breyer and Roberts (2000b) derive the representation (6.20) by the mean of 
a functional equation (see Problem 6.33). 

Simulating from tt thus amounts to simulating T* according to (6.21) and 
then, for T* = t, to simulating Nf. Simulating the latter starts from the 
minorizing measure u and then runs t — 1 steps of the residual distribution 



K{xr) 



K{x,‘) - elc{x)iy{-) 
1 - elc{x) 



In cases when simulating from the residual is too complex, a brute force 
Accept-Reject implementation is to run the split chain t iterations until rc > 
t, but this may be too time-consuming in many situations. Robert and Robert 
(2004) also propose more advanced approaches to the simulation of T* . 

Note that, when the state space X is small, the chain is said to be uniformly 
ergodic (see Definition 6.58 below), K{x,y) = (K{x,y) - eu{y))/{l - e) and 
the mixture representation (6.20) translates into the following algorithm. 



Algorithm A. 23 -Kac’s Mixture Implementation— 



1 . Simulate Xq ^ u ^ lo ^ » 

2. Run the transition At+i ^ t = 0, - 


■ - 1, 


and take X^. 





6.5.3 Reversibility and the Detailed Balance Condition 

The stability property inherent to stationary chains can be related to another 
stability property called reversibility^ which states that the direction of time 
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does not matter in the dynamics of the chain (see also Problems 6.53 and 
6.54). 

Definition 6.44. A stationary Markov chain {Xn) is reversible if the distri- 
bution of An+i conditionally on Xn -\-2 = x is the same as the distribution of 
XnJri conditionally on Xn = x. 

In fact, reversibility can be linked with the existence of a stationary mea- 
sure 7T if a condition stronger than in Definition 6.35 holds. 

Definition 6.45. A Markov chain with transition kernel K satisfies the de- 
tailed balance condition if there exists a function / satisfying 

(6.22) K{y,x)fiy)=K{x,y)f{x) 

for every (x, y). 

While this condition is not necessary for / to be a stationary measure 
associated with the transition kernel AT, it provides a sufficient condition that 
is often easy to check and that can be used for most MCMC algorithms. The 
balance condition (6.22) express an equilibrium in the flow of the Markov 
chain, namely that the probability of being in x and moving to y is the same 
as the probability of being in y and moving back to x. When / is a density, 
it also implies that the chain is reversible.^ More generally. 



Theorem 6.46. Suppose that a Markov chain with transition function K sat- 
isfies the detailed balance condition with tt a probability density function. Then: 

(i) The density tt is the invariant density of the chain. 

(a) The chain is reversible. 



Proof. Part (i) follows by noting that, by the detailed balance condition, for 
any measurable set 



/ K{y,B)7r{y)dy = / / K{y,x)7r{y)dxdy 
Jy Jy Jb 

= K{x,y)7r{x)dxdy = / 7r(x)dx, 

Jy Jb Jb 



since J K{x,y)dy = 1. With the existence of the kernel K and invariant den- 
sity 7T, it is clear that detailed balance and reversibility are the same property. 

□ 



If /(x,y) is a joint density, then we can write (with obvious notation) 

= fx\Y{x\y)fY{y) 
fi^^y) = fY\x{y\x)fx{x), 

and thus detailed balance requires that fx = fY and fx\Y = fY\x^ fhat is, 
there is symmetry in the conditionals and the marginals are the same. 

^ If there are no measure-theoretic difficulties with the definition of the kernel K, 
both notions are equivalent. 
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6.6 Ergodicity and Convergence 

6.6.1 Ergodicity 

Considering the Markov chain {Xn) from a temporal perspective, it is nat- 
ural (and important) to establish the limiting behavior of that is, To 
what is the chain converging! The existence (and uniqueness) of an invariant 
distribution tt makes that distribution a natural candidate for the limiting 
distribution, and we now turn to finding sufficient conditions on {Xn) for Xn 
to be asymptotically distributed according to tt. The following theorems are 
fundamental convergence results for Markov chains and they are at the core 
of the motivation for Markov chain Monte Carlo algorithms. They are, un- 
fortunately, if not surprisingly, quite difficult to establish and we restrict the 
proof to the countable case, the extension to the general case being detailed 
in Meyn and Tweedie (1993, pp. 322-323). 

There are many conditions that can be placed on the convergence of 
the distribution of Xn, to tt. Perhaps, the most fundamental and important 
is that of ergodicity^ that is, independence of initial conditions. 

Definition 6.47. For a Harris positive chain (Xn), with invariant distribution 
7T, an atom a is ergodic if 

lim |X^(a, o) - 7r(a)| = 0 . 

n— j’oo 

In the countable case, the existence of an ergodic atom is, in fact, sufficient 
to establish convergence according to the total variation norm^ 

IImi - M 2 ||tv = sup - H 2 {A)\. 

A 

Proposition 6.48. If (Xn) is Harris positive on X and denumerable, and if 
there exists an ergodic atom a X, then, for every x £ X, 

lim \\K^{x, •) — 7t||tv = 0 . 

n— >oo 

Proof The first step follows from a decomposition formula called ''first en- 
trance and last exit”: 

K'"{x, y) = Px{Xn ^y,Ta> n) 
n-1 r 3 

+ E E ^ ^ k)Ki-\a, a) 

3 = 1 U=1 

(6.23) X Poc{Xn-j =y,Ta>n- j), 

which relates K'^{x,y) to the last visit to a. (See Problem 6.37.) This shows 
the reduced influence of the initial value x, since Px{Xn = y^T^> n) converges 
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to 0 with n. The expression (6.17) of the invariant measure implies, in addition, 
that 

oo 

7r(y) = 7r(a) ^ Poc{Xj =y,Ta> j). 

These two expressions then lead to 

||if"(x,-) -ttWtv = ^\K'^{x,y) -7r(y)| 
y 

- = y,Ta> n) 

y 

n—1 j 

+ 5^5] € Q,T« = k)K^-\a,a) - 7r(a) 

y j=l k=l 

X Pa{Xn-j =y,Ta>n- j) 

OO 

+ X] X] = 2/- > j) • 

y j=n-l 

The second step in the proof is to show that each term in the above decompo- 
sition goes to 0 as n goes to infinity. The first term is actually Px {tc > n) and 
goes to 0 since the chain is Harris recurrent. The third term is the remainder 
of the convergent series 

oo 

(6.24) y] 7r(a) Pa{Xj =y,Ta> j) = ^ 7r(y) . 

y j=i y 

The middle term is the sum over the ^/’s of the convolution of the two sequences 
= ISfe=iPx(^fc e a,Ta = k)K^~^{a,a) - 7r(a)| and bn = Pa{Xn = 
y^'^a ^ ^)- The sequence (a^) is converging to 0 since the atom a is ergodic 
and the series of the b^s is convergent, as mentioned. An algebraic argument 
(see Problem 6.39) then implies that (6.24) goes to 0 as n goes to oo. □ 

The decomposition (6.23) is quite revealing in that it shows the role of the 
atom a as the generator of a renewal process. Below, we develop an exten- 
sion which allows us to deal with the general case using coupling techniques. 
(These techniques are also useful in the assessment of convergence for Markov 
chain Monte Carlo algorithms.) Lindvall (1992) provides an introduction to 
coupling. 

The coupling principle uses two chains (X^) and (A^) associated with the 
same kernel, the “coupling” event taking place when they meet in a; that is, 
at the first time no such that A^o G a and A^^ G a. After this instant, the 
probabilistic properties of (A^) and (A^) are identical and if one of the two 
chains is stationary, there is no longer any dependence on initial conditions 
for either chain. Therefore, if we can show that the coupling time (that is, the 
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time it takes for the two chains to meet), is finite for almost every starting 
point, the ergodicity of the chain follows. 

For a recurrent atom a on a denumerable space A', let Ta{k) denote the 
fcth visit to a {k = 1,2,...), and let p = (p(l),p(2), • • • ) be the distribution 
of the excursion time^ 

Sk = Ta{k + 1 ) - Ta{k), 

between two visits to a. If q = (^(0),g(l), . . .) represents the distribution of 
Ta(l) (which depends on the initial condition, xq or //), then the distribution 
of Tc(n+1) is given by the convolution product qicp'^* (that is, the distribution 
of the sum of n iid rv’s distributed from p and of a variable distributed from 
g), since 

+ 1) = *Sn + • * • + + Tck(1). 

Thus, consider two sequences (Si) and (S'') such that Si,S 2 ,... and 
Si,S 2 , . . . are iid from p with So ^ q and Sq ~ r. We introduce the indi- 
cator functions 



n n 

and Zr(n) = 

j=0 j=0 

which correspond to the events that the chains (Xn) and (X^) visit a at time 
n. The coupling time is then given by 



Tqr = min {j;Zq{j) = Zr{j) = 1} , 

which satisfies the following lemma, whose proof can be found in Problem 
6.45. 

Lemma 6.49. If the mean excursion time satisfies 

oo 

mp — np{n) < oc 

n=0 

and if p is aperiodic (the g.c.d. of the support of p is 1), then the coupling 
time Tpq is almost surely finite, that is, 

P{Tpq < oo) = 1 , 



for every q. 

If p is aperiodic with finite mean rup, this implies that Zp satisfies 

(6.25) lim \P{Zq{n) = 1) - mr'l = 0 , 

as shown in Problem 6.42. The probability of visiting a at time n is thus 
asymptotically independent of the initial distribution and this result implies 
that Proposition 6.48 holds without imposing constraints in the discrete case. 
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Theorem 6.50. For a positive recurrent aperiodic Markov chain on a count- 
able space, for every initial state x, 

lim \\K^{x,-)-n\\TV = 0. 
n^oo 

Proof Since (X^) is positive recurrent, Ea[rc^] is finite by Theorem 6.37. 
Therefore, rup is finite, (6.25) holds, and every atom is ergodic. The result 
follows from Proposition 6.48. □ 



For general state-spaces X, Harris recurrence is nonetheless necessary in 
the derivation of the convergence of to tt. (Note that another characteri- 
zation of Harris recurrence is the convergence of \\K'^ — 7t\\tv to 0 for every 
value X, instead of almost every value.) 

Theorem 6.51. If (Xn) is Harris positive and aperiodic, then 



lim 

n— >oo 



J K'^{x,')ll{dx) — TT 



= 0 

TV 



for every initial distribution ja. 



This result follows from an extension of the denumerable case to strongly 
aperiodic Harris positive chains by splitting, since these chains always allow 
for small sets (see Section 6.3.3), based on an equivalent to the “first entrance 
and last exit” formula (6.23). It is then possible to move to arbitrary chains 
by the following result. 



Proposition 6.52. If tt is an invariant distribution for P, then 

[ {x , fi{dx) — TT 

J TV 

is decreasing in n. 

Proof First, note the equivalent definition of the norm (Problem 6.40) 



(6.26) 



I. 1 

II TV = - sup 
^ |^|<i 



J h{x)fi{dx) 



We then have 
2 



f (^x , ll(dx) — TT 

J TV 

j h{y)K^+^{x,dy)^i{dx) - J h{y)n{dy) 
f h{y) J K^{x,dw)K{w,dy)pi{dx) 
f h{y) J K{w,dy)Tr{dw) , 



= sup 

\h\<l I 

= sup 

\h\<l 
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since, by definition, {x ^ dy) = f K'^(x, dw)K(w, dy) and by the invari- 

ance of 7T, 7r{dy) = f K(w, dy)Tr(dw). Regrouping terms, we can write. 



2||y K"+^(x,-)Mdx) -TT 

= sup / / h{y)K{w^dy) K'^{x^dw)fi{dx) 

\h\<l\J U J 

-III d2/)j 7r{dw) | 

jy* h{w)K'^{x,dw)iji{dx) — J h{w)n{dw) 



< sup 
\h\<l I 



where the inequality follows from the fact that the quantity in square brackets 
is a function with norm less than 1. Hence, monotonicity of the total variation 
norm is established. □ 



Note that the equivalence (6.26) also implies the convergence 
(6.27) lim \E^[h{Xn)] - E^[h{X)]\ = 0 

n—*oo 



for every bounded function h. This equivalence is, in fact, often taken as the 
defining condition for convergence of distributions (see, for example, Billings- 
ley 1995, Theorem 25.8). We can, however, conclude (6.27) from a slightly 
weaker set of assumptions, where we do not need the full force of Harris re- 
currence (see Theorem 6.80 for an example). 

The extension of (6.27) to more general functions h is called h-ergodicity 
by Meyn and Tweedie (1993, pp. 342-344). 



Theorem 6.53. Let (Xn) he positive, recurrent, and aperiodic. 

(a) IfE^[\h{X)\] = oo, Ex[\h{Xn)\] oo for every x. 

(b) If f \h{x)\'K{dx) < oo, then 

(6.28) lim sup \Ey[m{Xn)] - E^[m{X)]\ = 0 

\m{x)\<\h{x)\ 

on all small sets C such that 



(6.29) 



sup Ey 
y^C 



Vc - 1 

E 



t=0 



< oo . 



Similar conditions appear as necessary conditions for the Central Limit 
Theorem (see (6.31) in Theorem 6.64). Condition (6.29) relates to a coupling 
argument, in the sense that the infiuence of the initial condition vanishes “fast 
enough,” as in the proof of Theorem 6.63. 
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6.6.2 Geometric Convergence 



The convergence (6.28) of the expectation of h{x) at time n to the expecta- 
tion of h{x) under the stationary distribution tt somehow ensures the proper 
behavior of the chain (Xn) whatever the initial value Xq (or its distribution). 
A more precise description of convergence properties involves the study of the 
speed of convergence of to tt. An evaluation of this speed is important for 
Markov chain Monte Carlo algorithms in the sense that it relates to stopping 
rules for these algorithms; minimal convergence speed is also a requirement 
for the application of the Central Limit Theorem. 

To study the speed of convergence more closely, we first introduce an 
extension of the total variation norm, denoted by || • \\h^ which allows for an 
upper bound other than 1 on the functions. The generalization is defined by 



IImIU = sup 

\g\<h 



J g{x)^i{dx) . 



Definition 6.54. A chain {Xn) is geometrically h-ergodic^ with h > 1 on 
A, if {Xn) is Harris positive, with stationary distribution tt, if (Xn) satisfies 
< oc, and if there exists Vh > I such that 

oo 

(6.30) r^WK'^ix, •) - n\\h < oo 

n=l 

for every x E X. The case h = 1 corresponds to the geometric ergodicity of 

(X„). 

Geometric /i-ergodicity means that ||AT^(x, •) — 'kWh is decreasing at least 
at a geometric speed, since (6.30) implies 

||i^"(:r,.)-^IU<Mr-" 



with 

oo 

n=l 

If {Xn) has an atom a, (6.30) implies that for a real number r > 1, 



Et 



,n=l 



< cx) and 



\Pa{Xn G a) - 7r(a)|r” < oo. 

n=l 



The series associated with \Pc^{Xn Go:) — rc{a)\ r'^ converges outside of the 
unit circle if the power series associated with {tq^ = n) converges for values 
of |r| strictly larger than 1. (The proof of this result, called KendalVs Theorem^ 
is based on the renewal equations established in the proof of Proposition 6.31.) 
This equivalence justifies the following definition. 
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Definition 6.55. An accessible atom a is geometrically ergodic if there exists 
r > 1 such that 

oo 

\K^{a, a) — 7r(a)| < oo 

n=l 

and o is a Kendall atom if there exists k > 1 such that 

< OO . 

If a is a Kendall atom, it is thus geometrically ergodic and ensures geo- 
metric ergodicity for (X^): 

Theorem 6.56. If (Xn) is -irreducible, with invariant distribution ir, and 
if there exists a geometrically ergodic atom a, then there exist r > 1, k > 1, 
and R < oo such that, for almost every x ^ X , 

oo 

r"||/s:”(a;, •) - t^\\tv < i? K“] < oo . 

n=l 

Example 6.57. Nongeometric returns to 0. For a chain on Z+ with 
transition matrix P = (jpij) such that 

Poj — Pjj — Pj-i Pjo ~ ^ ~ Pj 1 ^ 7j ~ 1? 

3 

Meyn and Tweedie (1993, p. 361) consider the return time to 0, tq, with mean 
Eo[tq] = y^7j {(1 ~ (^j) + 2/?j(l - /?j) H } 

3 

3 

The state 0 is thus an ergodic atom when all the 7 j’s are positive (yielding 
irreducibility) and 7^(1 - /3j)~^ < oo. Now, for r > 0, 

oo 

Eofr’"®] = r ~ • 

3 3 k=0 

For r > 1, if 1 as j oo, the series in the above expectation always 

diverges for j large enough. Thus, the chain is not geometrically ergodic. || 



6.6.3 Uniform Ergodicity 

The property of uniform ergodicity is stronger than geometric ergodicity in 
the sense that the rate of geometric convergence must be uniform over the 
whole space. It is used in the Central Limit Theorem given in Section 6.7. 
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Definition 6.58. The chain {Xn) is uniformly ergodic if 
lim sup \\K^{x,’) — 7t\\tv = ^ ’ 

Uniform ergodicity can be established through one of the following equivalent 
properties: 

Theorem 6.59. The following conditions are equivalent: 

(a) {Xn) is uniformly ergodic; 

(b) there exist R < oo and r > 1 such that 

\\K'^{x, •) — ttWtv < Rr~^ , for all x ^ X 

(c) (Xn) is aperiodic and X is a small set; 

(d) {Xn) is aperiodic and there exist a small set C and a real k > 1 such that 

sup < oo . 

If the whole space X is small, there exist a probability distribution, on 
X, and constants £ < 1, 5 > 0, and n such that, if p^{A) > e then 

inf K^{x,A) > 5 . 
xex 

This property is sometimes called Doehlin’s condition. This requirement shows 
the strength of the uniform ergodicity and suggests difficulties about the ver- 
ification. We will still see examples of Markov chain Monte Carlo algorithms 
which achieve this superior form of ergodicity (see Example 10.17). Note, 
moreover, that in the finite case, uniform ergodicity can be derived from the 
smallness of X since the condition 



P{Xn+i = y\Xn = x)> inf p^y = Py for every x,y£X, 



leads to the choice of the minorizing measure u as 



t^{y) = 



Py 

Pz 



as long as pjy > 0 for some y e X. (If {Xn) is recurrent and aperiodic, this 
positivity condition can be attained by a subchain (Tm) == {Xnd) for d large 
enough. See Meyn and Tweedie 1993, Chapter 16, for more details.) 



6.7 Limit Theorems 

Although the notions and results introduced in the previous sections are im- 
portant in justifying Markov chain Monte Carlo algorithms, in the following 
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chapters we will see that this last section is essential to the processing of 
these algorithms. In fact, the different convergence results (ergodicity) ob- 
tained in Section 6.6 deal only with the probability measure (through 
different norms), which is somewhat of a “snapshot” of the chain {Xn) at 
time n. So, it determines the probabilistic properties of average behavior of 
the chain at a fixed instant. Such properties, even though they provide justi- 
fication for the simulation methods, are of lesser importance for the control 
of convergence of a given simulation, where the properties of the realization 
{xn) of the chain are the only characteristics that truly matter. (Meyn and 
Tweedie 1993 call this type of properties “sample path” properties.) 

We are thus led back to some basic ideas, previously discussed in a statis- 
tical setup by Robert (2001, Chapters 1 and 11); that is, we must consider the 
difference between probabilistic analysis^ which describes the average behavior 
of samples, and statistical inference^ which must reason by induction from the 
observed sample. While probabilistic properties can justify or refute some sta- 
tistical approaches, this does not contradict the fact that statistical analysis 
must be done conditional on the observed sample. Such a consideration can 
lead to the Bayesian approach in a statistical setup (or at least to considera- 
tion of the Likelihood Principle', see, e.g., Berger and Wolpert 1988, or Robert 
2001, Section 1.3). In the setup of Markov chains, a conditional analysis can 
take advantage of convergence properties of to tt only to verify the conver- 
gence, to a quantity of interest, of functions of the observed path of the chain. 
Indeed, the fact that \\P^ ~ '7t|| is close to 0, or even converges geometrically 
fast to 0 with speed (0 < p < 1), does not bring direct information about 
the unique available observation from P^, namely 

The problems in directly applying the classical convergence theorems (Law 
of Large Numbers, Law of the Iterated Logarithm, Central Limit Theorem, 
etc.) to the sample (Xi, . . . , X^) are due both to the Markovian dependence 
structure between the observations Xj and to the non-stationarity of the se- 
quence. (Only if Xo ~ tt, the stationary distribution of the chain, will the 
chain be stationary. Since this is equivalent to integrating over the initial con- 
ditions, it eliminates the need for a conditional analysis. Such an occurrence, 
especially in Markov chain Monte Carlo, is somewhat rare."^) 

We therefore assume that the chain is started from a point Xq whose dis- 
tribution is not the stationary distribution of the chain, and thus we deal 
with non-stationary chains directly. We begin with a detailed presentation of 
convergence results equivalent to the Law of Large Numbers, which are often 
called ergodic theorems. We then mention in Section 6.7.2 various versions 
of the Central Limit Theorem whose assumptions are usually (and unfortu- 
nately) difficult to check. 

^ Nonetheless, there is considerable research in MCMC theory about perfect simu- 
lation', that is, ways of starting the algorithm with Xo ~ tt. See Chapter 13. 
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6.7.1 Ergodic Theorems 

Given observations Xi , . . . , Xn of a Markov chain, we now examine the lim- 
iting behavior of the partial sums 

Sn{h) = - V h{Xi) 

n 

i=l 

when n goes to infinity, getting back to the iid case through renewal when 
(Xn) has an atom. Consider first the notion of harmonic functions^ which is 
related to ergodicity for Harris recurrent Markov chains. 

Definition 6.60. A measurable function h is harmonic for the chain {Xn) if 

E[h{Xn+l)\Xn] = h{Xn). 

These functions are invariant for the transition kernel (in the functional 
sense) and they characterize Harris recurrence as follows. 

Proposition 6.61. For a positive Markov chain, if the only bounded har- 
monic functions are the constant functions, the chain is Harris recurrent. 

Proof First, the probability of an infinite number of returns, Q{x,A) = 
Px{va = oo), as a function of x, h{x), is clearly a harmonic function. This is 
because 

Ey[h{Xi)] = Ey[Px,{riA = oo)] = Py{r]A = oo), 
and thus, Q{x,A) is constant (in x). 

The function Q{x, A) describes a tail event, an event whose occurrence 
does not depend on Ai, X 2 , . . . , A^, for any finite m. Such events generally 
obey a 0 — 1 law, that is, their probabilities of occurrence are either 0 or 1. 
However, 0—1 laws are typically established in the independence case, and, 
unfortunately, extensions to cover Markov chains are beyond our scope. (For 
example, see the Hewitt-Savage 0-1 Law, in Billingsley 1995, Section 36.) 
For the sake of our proof, we will just state that Q{x, A) obeys a 0 — 1 Law 
and proceed. 

If 7T is the invariant measure and 7 t(A) > 0, the case Q{x,A) = 0 is 
impossible. To see this, suppose that Q{x,A) = 0. It then follows that the 
chain almost surely visits A only a finite number of times and the average 

i=l 

will not converge to 7t{A), contradicting the Law of Large Numbers (see The- 
orem 6.63). Thus, for any x, Q{x,A) = 1, establishing that the chain is a 
Harris chain. □ 
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Proposition 6.61 can be interpreted as a continuity property of the tran- 
sition functional Kh{x) = Ex[h{Xi)] in the following sense. By induction, a 
harmonic function h satisfies h{x) = Ex[h{Xn)] and by virtue of Theorem 
6.53, h{x) is almost surely equal to E^[/i(X)]; that is, it is constant almost 
everywhere. For Harris recurrent chains. Proposition 6.61 states that this im- 
plies h{x) is constant everywhere. (Feller 1971, pp. 265-267, develops a related 
approach to ergodicity, where Harris recurrence is replaced by a regularity 
constraint on the kernel.) 

Proposition 6.61 will be most useful in establishing Harris recurrence of 
some Markov chain Monte Carlo algorithms. Interestingly, the behavior of 
bounded harmonic functions characterizes Harris recurrence, as the converse 
of Proposition 6.61 is true. We state it without its rather difficult proof (see 
Meyn and Tweedie 1993, p. 415). 

Lemma 6.62. For Harris recurrent Markov chains, the constants are the only 
bounded harmonic functions. 

A consequence of Lemma 6.62 is that if {Xn) is Harris positive with sta- 
tionary distribution tt and if Sn{h) converges iiQ-dlmost surely (/io a.s.) to 

/ h{x) 7r{dx) , 

Jx 

for an initial distribution /io, this convergence occurs for every initial distri- 
bution /i. Indeed, the convergence probability 

PASN{h)^¥.^[h]) 

is then harmonic. Once again, this shows that Harris recurrence is a superior 
type of stability in the sense that almost sure convergence is replaced by 
convergence at every point. 

Of course, we now know that if functions other than bounded functions 
are harmonic, the chain is not Harris recurrent. This is looked at in detail in 
Problem 6.59. 

The main result of this section, namely the Law of Large Numbers for 
Markov chains (which is customarily called the Ergodic Theorem) , guarantees 
the convergence of Sn{h). 

Theorem 6.63. Ergodic Theorem If{Xn) has a a -finite invariant measure 
TT, the following two statements are equivalent: 

(i) If f,g ^ L^{7 t) with f g{x)dir{x) ^ 0, then 

lim = I fjx)dn{x) 

Sn{g) J g{x)dn{x) 

(a) The Markov chain (Xn) is Harris recurrent. 
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Proof. If (i) holds, take / to be the indicator function of a set A with finite 
measure and g an arbitrary function with finite and positive integral. If 'k{A) > 
0, 

Px[X ^ A infinitely often) = 1 

for every x G A', which establishes Harris recurrence. 

If (ii) holds, we need only to consider the atomic case by a splitting argu- 
ment. Let a be an atom and be the time of the (k -h l)th visit to a. If 
is the number of visits to a at time N ^ we get the bounds 

^N-l 'TcO + 1) N 

j=0 n=Tc(j) + l k=l 

£n 

sE 

j=0 

The blocks 

= 13 

n = Ta{j)-{-l 

are independent and identically distributed. Therefore, 

Er^i /(^i) ^ iZjZo Sj{f)+Y:iu fi^k))/iN 

EHi 9(xi) - In- 1 Sj{g)/{£N - 1) 

The theorem then follows by an application of the strong Law of Large Num- 
bers for iid rv’s. □ 

An important aspect of Theorem 6.63 is that tt does not need to be a 
probability measure and, therefore, that there can be some type of strong 
stability even if the chain is null recurrent. In the setup of a Markov chain 
Monte Carlo algorithm, this result is sometimes invoked to justify the use of 
improper posterior measures, although we fail to see the relevance of this kind 
of argument (see Section 10.4.3). 

6.7.2 Central Limit Theorems 

There is a natural progression from the Law of Large Numbers to the Central 
Limit Theorem. Moreover, the proof of Theorem 6.63 suggests that there is a 
direct extension of the Central Limit Theorem for iid variables. Unfortunately 
this is not the case, as conditions on the finiteness of the variance explicitly 
involve the atom a of the split chain. Therefore, we provide alternative con- 
ditions for the Central Limit Theorem to apply in different settings. 



f{xk) 

Ta(i + 1) Ta(0) 

XI + X • 

n=Tcxij) + l k=l 




6.7 Limit Theorems 243 



6.7.2. 1 The Discrete Case 

The discrete case can be solved directly, as shown by Problems 6.50 and 6.51. 
Theorem 6.64. If (Xn) is Harris positive with an atom a such that 



(6.31) 

and 



< OO, 



Ih = Ea 






\n=l 






< OO 



> 0 , 



\n=l 



the Central Limit Theorem applies; that is, 

N 



Proof Using the same notation as in the proof of Theorem 6.63, if h denotes 
h — W [h] , we get 



£n I 

J2Si{h)^^f 0,E, 
7^1 \ 



1 2 > 



E 

Ln=l 



following from the Central Limit Theorem for the independent variables Si{f), 
while N/ijsf converges a.s. to Ea[5o(l)] = l/7r(o:). Since 



N 



iw-i 

E Si(h) - E h{Xk) 



i=l 



k=l 



<SeM) 



and 



we get 



lim - E 5,(|/^|)2 = E,.[5o(|/^|)2] , 



n— >oo ft 






N^oo VN 

and the remainder goes to 0 almost surely. 



□ 



This result indicates that an extension of the Central Limit Theorem to the 
nonatomic case will be more delicate than for the Ergodic Theorem: Condi- 
tions (6.31) are indeed expressed in terms of the split chain (X^). (See Section 
12.2.3 for an extension to cases when there exists a small set.) In Note 6.9.1, 
we present some alternative versions of the Central Limit Theorem involving 
a drift condition. 
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6. 7. 2. 2 Reversibility 

The following theorem avoids the verification of a drift condition, but rather 
requires the Markov chain to be reversible (see Definition 6.44). 

With the assumption of reversibility, this Central Limit Theorem directly 
follows from the strict positivity of 7^. This was established by Kipnis and 
Varadhan (1986) using a proof that is beyond our reach. 

Theorem 6.65. If{Xn) is aperiodic, irreducible, and reversible with invariant 
distribution n, the Central Limit Theorem applies when 

00 

0 < 7 s" = E^[f{Xo)] + 2 ^ EM^oMXk)] < +00. 

k=l 

The main point here is that even though reversibility is a very restrictive 
assumption in general, it is often easy to impose in Markov chain Monte 
Carlo algorithms by introducing additional simulation steps (see Geyer 1992, 
Tierney 1994, Green 1995). See also Theorem 6.77 for another version of the 
Central Limit Theorem, which relies on a “drift condition” (see Note 6.9.1) 
similar to geometric ergodicity. 

Example 6.66 (Continuation of Example 6.43). For the AR(1) chain, 
the transition kernel corresponds to the J\f{0xn-i,cr‘^) distribution, and the 
stationary distribution is A/"(0, cr^/(l — ^^)). It is straightforward to verify that 
the chain is reversible by showing that (Problem 6.65) 

X„+i\Xn ^ M{0xn,a-'^) and ~ A/'(6»a;„+i,CT^). 

Thus the chain satisfies the conditions for the CLT. 

Figure 6.2 shows histograms of means for the cases of ^ = .5 and ^ = 2. In 
the first case (left panel) we have a positive recurrent chain that satisfies the 
conditions of the CLT. The right panel is most interesting, however, because 
6 = 2 and the chain is transient. However, the histogram of the means “looks” 
quite well behaved, giving no sign that the chain is not converging. 

It can happen that null recurrent and transient chains can often look well 
behaved when examined graphically through some output. However, another 
picture shows a different story. In Figure 6.3 we look at the trajectories of 
the cumulative mean and standard deviation from one chain of length 1000. 
There, the left panel corresponds to the ergodic case with 6 = .5, and the right 
panel corresponds to the (barely) transient case of ^ = 1.0001. However, it is 
clear that there is no convergence. See Section 10.4.3 for the manifestation of 
this in MCMC algorithms. || 
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Fig. 6.2. Histogram of 2500 means (each based on 50 observations) from an AR(1) 
chain. The left panel corresponds to 6 = .5, which results in an ergodic chain. The 
right panel corresponds to ^ = 2, which corresponds to a transient chain. 




Fig. 6.3. Trajectories of mean (solid line) and standard deviation (dashed line) from 
the AR(1) process of Example 6.66. The left panel has ^ = .5, resulting in an ergodic 
Markov chain, and displays convergence of the mean and standard deviation. The 
right panel has 6 = 1.0001, resulting in a transient Markov chain and no convergence. 



6. 7.2. 3 Geometric Ergodicity and Regeneration 

There is yet another approach to the Central Limit Theorem for Markov 
chains. It relies on geometric ergodicity, a Liapounov-type moment condition 
on the function h, and a regeneration argument. Robert et al. (2002), extend- 
ing work of Chan and Geyer (1994) (see Problem 6.66), give specific conditions 
for Theorem 6.67 to apply, namely for Liapounov condition to apply and a 
consistent estimate of 7^ to be found. 

Theorem 6.67. If (Xn) is aperiodic, irreducible, positive Harris recurrent 
with invariant distribution tt and geometrically ergodic, and if, in addition. 
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(6.32) E’^[|/i(X)| 2+^] < oo 
for some 5 > 0, then 

(6.33) {Sn{h)/n - E"[/i(X)]) 4 V(0, 7^) , 

where 7^ is defined as in Theorem 6.65. 

They first discuss the difficulty in finding such estimates, as fixed batch 
mean approximations are not consistent when the batch size is fixed. We can, 
however, use regeneration (Mykland et al. 1995) when available; that is, when 
a minorization condition as in Sections 6.3.2 and 6.5.2 holds: there exists a 
function 0 < s(x) < 1 and a probability measure Q such that, for all x G A' 
and all measurable sets A, 

(6.34) P(x, A) > s{x) Q{A) . 

Following an idea first developed in Robert (1995a) for MCMC algorithms, 
Robert et al. (2002) then construct legitimate asymptotic standard errors 
bypassing the estimation of 7^. 

The approach is to introduce the regeneration times 0 = tq < n < T2 < • • • 
associated with the Markov chain (X^) and to write Sn{h) in terms of the 
regeneration times, namely, if the chain is started as Xq ~ Q and stopped 
after the T-th regeneration, 

Srrih) = Y^ E HXj) = f2St^ 

t=l t = l 



where the S't’s are the partial sums appearing in Theorem 6.63, which are iid. 
If we define the inter-regeneration lengths Nt = Tt — Tt-i, then 



(6.35) 



hrx 






St 

Nt 



Tt 



tt — 1 



E 9iXj) 

j=0 



converges almost surely to E'^[h{X)] when T goes to infinity, by virtue of the 
Ergodic Theorem (Theorem 6.63), since tt converges almost surely to 00. 

By Theorem 6.37, = l/E'^[s{X)] (which is assumed to be finite). 

It follows from the Strong Law of Large Numbers that N converges almost 
surely to E^[A^i], which together with (6.35) implies that St converges almost 
surely to E^[Ni]E'^[h{X)]. This implies in particular that E^[|aSi|] < 00 and 
E^[5i] = E^[N'i]E^[/i(X)]. Hence, the random variables St — NtE'^[h{X)], 
are iid and centered. Thus 

Theorem 6.68. IfE^lSf] andE^[Ni] are both finite, the Central Limit The- 
orem applies: 




VR{hr^-E’^[h{X)])'^Af{0,al) 
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(6.36) 



where 



<^h = 






(5i - iViE-[/i(X)])2' 



{E«[ATi]}^ 



While it seems that (6.33) and (6.36) are very similar, the advantage in 
using this approach is that a\ can be estimated much more easily due to the 
underlying independent structure. For instance, 






is a consistent estimator of a\. 

In addition, the conditions on [Sf] and [Ni] appearing in Theorem 
6.68 are minimal in that they hold when the conditions of Theorem 6.67 hold 
(see Robert et al. 2002, for a proof). 



6.8 Problems 

6.1 Examine whether a Markov chain (Xt) may always be represented by the deter- 
ministic transform Xt+i = '0(Xt,et), where (et) is a sequence of iid rv’s. {Hint: 
Consider that et can be of infinite dimension.) 

6.2 Show that if (Xn) is a time- homogeneous Markov chain, the transition kernel 
does not depend on n. In particular, if the Markov chain has a finite state-space, 
the transition matrix is constant. 

6.3 Show that an ARMA{p,q) model, defined by 

p q 

Xn — ^ ^ OLiXn—i T ^ ^ Pj^n—j T 

i=l 3 = 1 

does not produce a Markov chain. {Hint: Examine the relation with an AR{q) 
process through the decomposition 

p q 

^ ^ OLiZn — i ~h Xn — ^ ^ PjZn—j Zn-, 

i=l 3=1 

since (Tn) and {Xn) are then identically distributed.) 

6.4 Show that the resol vant kernel of Definition 6.8 is truly a kernel. 

6.5 Show that the properties of the resol vant kernel are preserved if the geometric 
distribution Qeo{e) is replaced by a Poisson distribution V{\) with arbitrary 
parameter A. 

6.6 Derive the strong Markov property from the decomposition 

E^[/i(X^+i,X^+ 2, . . .)|x^,x^_i, . . .] 

oo 

— ^ ^ E^[/l(X7T,-f 1 , Xn-{-2t • * ^n— 1? • • • ? C ~ ^]F(^ = ?T/|Xn5 Xn—li • • •) 

n=l 

and from the weak Markov property. 
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6.7 Given the transition matrix 

/O.O 0.4 0.6 0.0 0.0 \ 

0.65 0.0 0.35 0.0 0.0 

P = 0.32 0.68 0.0 0.0 0.0 , 

0.0 0.0 0.0 0.12 0.88 

\0.0 0.0 0.0 0.56 0.44/ 

examine whether the corresponding chain is irreducible and aperiodic. 

6.8 Show that irreducibility in the sense of Definition 6.13 coincides with the more 
intuitive notion that two arbitrary states are connected when the Markov chain 
has a discrete support. 

6.9 Show that an aperiodic Markov chain on a finite state-space with transition 
matrix P is irreducible if and only if there exists N £ N such that P^ has no 
zero entries. (The matrix is then called regular.) 

6.10 (Kemeny and Snell 1960) Show that for a regular matrix P: 

(a) The sequence (P^) converges to a stochastic matrix A. 

(b) Each row of A is the same probability vector tt. 

(c) All components of tt are positive. 

(d) For every probability vector converges to it. 

(e) TT satisfies tt = ttP. 

{Note: See Kemeny and Snell 1960, p. 71 for a full proof.) 

6.11 Show that for the measure given by (6.9), the chain (Xn) is irreducible in 
the sense of Definition 6.13. Show that for two measures (.pi and (^ 2 , such that 
(Xn) is (^^-irreducible, the corresponding t/^i’s given by (6.9) are equivalent. 

6.12 Let Yi, I 2 , . . . be iid rv’s concentrated on N+ and Yq be another rv also con- 
centrated on N+. Define 

n 

Zn = '^n. 

i=o 

(a) Show that (Zn) is a Markov chain. Is it irreducible? 

(b) Define the forward recurrence time as 

Vrt = inf{Zm -n;Zm > n}. 

Show that (V^) is also a Markov chain. 

(c) If Vn = /c > 1, show that = k — 1. U = 1, show that a renewal 

occurs at n + 1. {Hint: Show that Yi in the latter case.) 

6.13 Detail the proof of Theorem 6.15. In particular, show that the fact that 
includes a Dirac mass does not invalidate the irreducibility. {Hint: Establish 
that 

Ea:[?7A] = > Px{ta < Oo) , 

n 

\im Ke{x,A) > Px{ta < 00 ) , 

e— >1 

00 

Ke{x, A) = {1 — e) ^ e*P*(x, A) > 0 

i=l 

imply that there exists n such that K'^{x, A) > 0. See Meyn and Tweedie 1993, 
p. 87.) 




6,14 Show that the multiplicative random walk 

Xt+l = Xt€t 
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is not irreducible when et ~ Sxp{l) and xo G M. {Hint: Show that it produces 
two irreducible components.) 

6.15 Show that in the setup of Example 6.17, the chain is not irreducible when Cn 
is uniform on [—1, 1] and |^| > 1. 

6.16 In the spirit of Definition 6.25, we can define a uniformly transient set as a set 
A for which there exists M < oo with 

< m, \/x e a . 

Show that transient sets are denumerable unions of uniformly transient sets. 

6.17 Show that the split chain defined on X x {0,1} by the following transition 
kernel: 



-(i-e) 



F(Xn+i e. 4 x { 0 }|(a„, 0 )) 

P{Xn,A n C^) - eu{A n C^) \ 

+ 1-6 i 

+Ic=(x„) {P{Xn , ^ n C)(i - e) + P(Xn, ^ n c")} 

P(Xn+i e^x{l}|(a:„,0)) 

= Ic(x„) + Ic<=(X„) f 



= Ic(X„) 



e + lcc{Xn) P{Xn,AnC)e 



P{Xn+i €Ax {0}|(a;„, 1)) = iy(A n C)(l - e) + u{A n C^^), 

P{Xn+i G ^ X {l}l(xn, 1)) = I^{A n C)e , 

satisfies 

P{Xn+l eAX {l}|Xn) = £1^{A n C), 

P{Xn+i eAx {0}\xn) = 1 ^{A n C^) -h (1 - e)u{A n C) 

for every Xn G C x {!}. Deduce that C x {1} is an atom of the split chain (Xn). 

6.18 If C is a small set and B C under which conditions on 5 is B a small set? 

6.19 If (7 is a small set and D = {x; P^{x^ D) > ^}, show that D is a small set for 
S small enough. {Hint: Use the Chapman-Kolmogorov equations.) 

6.20 Show that the period d given in Definition 6.23 is independent of the selected 
small set C and that this number characterizes the chain (Xn). 

6.21 Given the transition matrix 

/O.O 0.4 0.6 0.0 0.0 \ 

0.6 0.0 .35 0.0 0.05 
P = 0.32 .68 0.0 0.0 0.0 , 

0.0 0.0 0.12 0.0 0.88 

\0.14 0.3 0.0 0.56 0.0 



show that the corresponding chain is aperiodic, despite the null diagonal. 
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The random walk (Examples 6.40 and 6.39) is a useful probability model and has 
been given many colorful interpretations. (A popular one is the description of an 
inebriated individual whose progress along a street is composed of independent 
steps in random directions, and a question of interest is to describe where the 
individual will end up.) Here, we look at a simple version to illustrate a number 
of the Markov chain concepts. 

6.22 A random walk on the non-negative integers I = {0,1,2,...} can be con- 
structed in the following way. For 0 < p < 1, let be iid random 

variables with P(Yi = 1) = p and P(Yi = —1) = 1 — p, and Xk = 

Then, (Xn) is a Markov chain with transition probabilities 



P{Xi+i = j + l\Xi = j) = p, =j-l\x^=j) = l- p, 

but we make the exception that P{Xi-^i = l\Xi = 0) = p and P(Xi+i = 0|Ai = 

0) = 1 - p. 

(a) Show that (Xn) is a Markov chain. 

(b) Show that (Xn) is also irreducible. 

(c) Show that the invariant distribution of the chain is given by 

Ok — ( ^ uo, /c==l,2, ..., 



where Ok is the probability that the chain is at k and ao is arbitrary. For 
what values of p and ao is this a probability distribution? 

(d) li Ok < cxD, show that the invariant distribution is also the stationary 
distribution of the chain; that is, the chain is ergodic. 

6.23 If (Xt) is a random walk, Xt+i — Xt+ct, such that et has a moment generating 
function /, defined in a neighborhood of 0, give the moment generating function 
of Xt+i, pt+i in terms of gt and /, when Xo = 0. Deduce that there is no 
invariant distribution with a moment generating function in this case. 

Although the property of aperiodicity is important, it is probably less impor- 
tant than properties such as recurrence and irreducibility. It is interesting that 
Feller (1971, Section XV. 5) notes that the classification into periodic and aperi- 
odic states “represents a nuisance.” However, this is less true when the random 
variables are continuous. 

6.24 (Continuation of Problem 6.22) 

(a) Using the definition of periodic given here, show that the random walk of 
Problem 6.22 is periodic with period 2. 

(b) Suppose that we modify the random walk of Problem 6.22 by letting 0 < 
p + ^ < 1 and redefining 




1 

0 

-1 



with probability p 
with probability 1 — p — q 
with probability q. 



Show that this random walk is irreducible and aperiodic. Find the invariant 
distribution, and the conditions on p and q for which the Markov chain is 
positive recurrent. 




6.8 Problems 251 



6.25 (Continuation of Problem 6.22) A Markov chain that is not positive recurrent 
may be either null recurrent or transient. In either of these latter two cases, the 
invariant distribution, if it exists, is not a probability distribution (it does not 
have a finite integral), and the difference is one of expected return times. For 
any integer j, the probability of returning to j in k steps is = P{Xi^k = 

j\Xi = j), and the expected return time is thus mjj = ' 

(a) Show that since the Markov chain is irreducible, mjj = oo either for all j 
or for no j; that is, for any two states x and y, x is transient if and only if 
y is transient. 

(b) An irreducible Markov chain is transient if mjj = oo; otherwise it is re- 
current. Show that the random walk is positive recurrent if p < 1/2 and 
transient if p > 1/2. 

(c) Show that the random walk is null recurrent if p == 1/2. This is the interest- 
ing case where each state will be visited infinitely often, but the expected 
return time is infinite. 

6.26 Explain why the resolvant chain is necessarily strongly irreducible. 

6.27 Consider a random walk on M+, defined as 



Xn+l = (X„+£)+. 

Show that the sets (0, c) are small, provided P{e < 0)>0. 

6.28 Consider a random walk on Z with transition probabilities 

P{Zt = n -f l\Zt-i = n) = 1 — P{Zt = n— l\Zt-i = n) oc n~^ 



and 

P{Zt = l\Zt-i =0) = 1- P{Zt = -l\Zt-i =0) = 1/2 . 

Study the recurrence properties of the chain in terms of a. 

6.29 Establish (i) and (ii) of Theorem 6.28. 

(a) Use 

K^{x,A) > K^{x,a)K\a,a)K\a,A) 
ioT r s 1 = n and r and s such that 

K^{x,a)>0 and K^{a,A)>0 
to derive from the Chapman-Kolmogorov equations that Ex [pa] = oo when 

Ea[pa] = OO. 

(b) To show (ii): 

a) Establish that transience is equivalent to Pa(T« < oo) < 1. 

b) Deduce that Ex[pa] < oo by using a generating function as in the proof 
of Proposition 6.31. 

c) Show that the covering of X is made of the 

= {2/; XI a) > j~^}- 

n=l 

6.30 Referring to Definition 6.32, show that if P{tja = 00 ) / 0 then Ex [pa] = 00 , 
but that P{t]a = 00 ) == 0 does not imply Ex [pa] < 00 . 

6.31 In connection with Example 6.42, show that the chain is null recurrent when 

/'(i) = 1- 
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6.32 Referring to (6.21): 

(a) Show that Ei,[tc 7 ] < oo; 

(b) show that >t) = Eu[rc]. 

6.33 Let r — {Zn : n = 0, 1, . . .} be a discrete time homogeneous Markov chain 
with state space Z and Markov transition kernel 

(6.37) M(z, •) = + (1 - oj)K(z, •) , 

where u 6 (0, 1) and is a probability measure. 

(a) Show that the measure 

CX) 

H-) = •) 

i=l 

is an invariant probability measure for F. 

(b) Deduce that F is positive recurrent. 

(c) Show that, when ^ satisfies a minorization condition with C = X, (6.10) 
holds for all X e X and is thus a mixture of the form (6.37). 

{Note: Even if the Markov chain associated with K is badly behaved, e.g., tran- 
sient, F is still positive recurrent. Breyer and Roberts (2000b) propose another 
derivation of this result, through the functional identity 

6.34 Establish the equality (6.14). 

6.35 Consider the simple Markov chain (Xn), where each Xi takes on the values 
-1 and 1 with P{Xi^i = l\Xi = -1) = 1, P{X^+l = -l\Xi = 1) = 1, and 
P{Xo - 1) = 1/2. 

(a) Show that this is a stationary Markov chain. 

(b) Show that cov(Xo, Xfc) does not go to zero. 

(c) The Markov chain is not strictly positive. Verify this by exhibiting a set that 
has positive unconditional probability but zero conditional probability. 

{Note: The phenomenon seen here is similar to what Seidenfeld and Wasserman 
1993 call a dilation.) 

6.36 In the setup of Example 6.5, find the stationary distribution associated with 
the proposed transition when tt^ = tt^ and in general. 

6.37 Show the decomposition of the “first entrance and last exit” equation (6.23). 

6.38 If (un) is a sequence of real numbers converging to a, and if 6n = (ui + • • • + 
Un)/n, then show that 

lim bn = a . 

n 

{Note: The sum (1/n) ai is called a Cesdro average^ see Billingsley 1995, 
Section A30.) 

6.39 Consider a sequence {an) of positive numbers which is converging to a* and a 
convergent series with running term bn- Show that the convolution 

n— 1 oo 

ajbn-j a 

j = l j = l 

{Hint: Use the Dominated Convergence Theorem.) 
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6.40 (a) Verify (6.26), namely, that ||a^||tv = (1/2) sup|;^l<i |/ h{x) jji{dx)\. 

(b) Show that (6.26) is compatible with the definition of the total variation 
norm. Establish the relation with the alternative definition 



M\tv = sup fi{A) - iid ij,{ A). 

A ^ 

6.41 Show that if (Xn) and (V^) are coupled at time No and if Xo ~ tt, then 
X'n ~ TT for n > Vo for any initial distribution of Xq. 

6.42 Using the notation of Section 6.6.1, set 

CXD 

j=0 

with the distribution of the sum Si -\ h S'j , the Dirac mass at 0, and 



Z{n) = l3j-,Sj=ri‘ 



(a) Show that Pq{Z{n) = 1) = q'ku{n). 

(b) Show that 

\q'ku{n) —p'ku{n)\ < 2P{Tpq > n). 

(This bound is often called Orey’s inequality^ from Orey 1971. See Problem 
7.10 for a slightly different formulation.) 

(c) Show that if rup is finite, 

, , Er+iP(i) 

e{n) = 1 

rUp 



is the invariant distribution of the renewal process in the sense that 
Pe{Z{n) = 1) = 1/rrip for every n. 

(d) Deduce from Lemma 6.49 that 



lim 

n 



q ★ u{n) 



1 

rUp 



= 0 



when the mean renewal time is finite. 

6.43 Consider the so-called “forward recurrence time” process which is a 
Markov chain on N+ with transition probabilities 



P{hj)=pU), i > 1, 

where p is an arbitrary probability distribution on N+. (See Problem 6.12.) 

(a) Show that {V ^ ) is recurrent. 

(b) Show that 

P{V+ =3) =p{j + n~\). 

(c) Deduce that the invariant measure satisfies 

^(i) = 

n>j 

and show it is finite if and only if 

rUp = "^^np{n) < oo. 
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6.44 (Continuation of Problem 6.43) Consider two independent forward recurrence 

time processes and (W^) with the same generating probability distribu- 

tion p. 

(a) Give the transition probabilities of the joint process = {Vn ^ ^n)- 

(b) Show that (KT) is irreducible when p is aperiodic. {Hint: Consider r and 
s such that g.c.d.(r, s) = 1 with p{r) > 0, p{q) > 0, and show that if 
nr — ms = 1 and i > then 

~ ( 1 ? 1 )) > 0 .) 

(c) Show that tt* = tt x tt, with tt defined in Problem 6.43 is invariant and, 
therefore, that (V*) is positive Harris recurrent when mp < oo. 

6.45 (Continuation of Problem 6.44) Consider V* defined in Problem 6.44 associ- 
ated with (S'n, S'n) and define ti,i = min{n; V* = (1, 1)}. 

(a) Show that Tpq = ri,i H- 1. 

(b) Use (c) in Problem 6.44 to show Lemma 6.49. 

6.46 (Kemeny and Snell 1960) Establish (directly) the Law of Large Numbers for a 
finite irreducible state-space chain (Xn) and for h{xn) = Ij(xn), if j is a possible 
state of the chain; that is. 



n=l 

where tt = (tti, . . . , tt^ , . . .) is the stationary distribution. 

6.47 (Kemeny and Snell 1960) Let P be a regular transition matrix, that is, P^ = 
AF (see Problem 6.9), with limiting (stationary) matrix A; that is, each column 
of A is equal to the stationary distribution. 

(a) Show that the so-called fundamental matrix Z = (/ — (P — A))~^ exists. 

(b) Show that Z = 7 -f X^^i(P’^ — A). 

(c) Show that Z satisfies ttZ = tt and PZ = ZP , where tt denotes a row of A 
(this is the stationary distribution). 

6.48 (Continuation of Problem 6.47) Let Nj{n) be the number of times the chain 
is in state j in the first n instants. 

(a) Show that for every initial distribution /z, 

lim E^[Nj(n)] — nnj = p,{Z — A). 

n—*oo 

{Note: This convergence shows the strong stability of a recurrent chain since 
each term in the difference goes to infinity.) 

(b) Show that for every pair of initial distributions, (//, z/), 

lim E^[Nj(n)] — Eu[Nj{n)] — (/x — z/)Z. 

n^oo 

(c) Deduce that for every pair of states, (u, i;), 

lim Eu[Nj{n)] -E^[Nj(n)] = Zuj - Zyj, 

n—^oo 

which is called the divergence divj{u,v). 

6.49 (Continuation of Problem 6.47) Let fj denote the number of steps before 
entering state j. 
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(a) Show that for every state Ei[fj] is finite. 

(b) Show that the matrix M with entries rriij = ^i[fj] can be written M = 
P(M — Md) + E, where Md is the diagonal matrix with same diagonal as 
M and E is the matrix made of I’s. 

(c) Deduce that mu = I/tti. 

(d) Show that ttM is the vector of the zu/T^i's. 

(e) Show that for every pair of initial distributions, (/i, u), 

E^[fi]-E4fi] = {fi-u){I-Z)D, 

where D is the diagonal matrix diag(l/7Ti). 

6.50 If h is a function taking values on a finite state-space {1, . . . , r}, with h{i) = hi^ 
and if (Xn) is an irreducible Markov chain, show that 



lim — var | h{xt) j = hidjhj, 
n-^oo n j . . 



where Cij = niZij + TTjZji — niSij — niiTj and 6ij is Kroenecker’s 0-1 function. 



6.51 For the two-state transition matrix ' 



/l — a a \ 

V p 1-/^;’ 



show that 



(a) the stationary distribution is tt = {/3 / {a P) , a/ (a /3))] 

(b) the mean first passage matrix is 



M = 



(a + /3)//3 1/a 

1/(3 {a + (3)/a 



(c) and the limiting variance for the number of times in state j is a/?(2 — a — /3)/(a + (3 )^ , 
for j = 1, 2. 

6.52 Show that a finite state-space chain is always geometrically ergodic. 

6.53 (Kemeny and Snell 1960) Given a finite state-space Markov chain, with tran- 
sition matrix P, define a second transition matrix by 



Pij{n) = 



Pm {Xn-l = j)PjXn = i\Xn-l = j) 



P^(Xn=j) 

(a) Show that Pij(n) does not depend on n if the chain is stationary (i.e., if 

P = 7t). 

(b) Explain why, in this case, the chain with transition matrix P made of the 
probabilities 

Pij — 

is called the reverse Markov chain. 

(c) Show that the limiting variance C is the same for both chains. 

6.54 (Continuation of Problem 6.53) A Markov chain is reversible if P = P. Show 
that every two-state ergodic chain is reversible and that an ergodic chain with 
symmetric transition matrix is reversible. Examine whether the matrix 

/ 0 0 1 0 0 \ 

0.5 0 0.5 0 0 

0 0.5 0 0.5 0 

0 0 0.5 0 0.5 

V 0 0 1 0 0 / 

is reversible. (Hint: Show that tt == (0.1, 0.2, 0.4, 0.2, 0.1).) 
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6.55 (Continuation of Problem 6.54) Show that an ergodic random walk on a finite 
state-space is reversible. 

6.56 (Kemeny and Snell 1960) A Markov chain (Xn) is lumpable with respect to a 
nontrivial partition of the state-space, (Ai , . . . ,Ak), if, for every initial distri- 
bution fj,, the process 

k 

i = l 

is a Markov chain with transition probabilities independent of fx. 

(a) Show that a necessary and sufficient condition for lumpability is that 

PuAj — ^ ^ Puv 
veAj 

is constant (in n) on Ai for every i. 

(b) Examine whether 

/ 1 0 0 0 0 \ 

0 10 0 0 

0.5 0 0 0.5 0 

0 0 0.5 0 0.5 

V 0 0.5 0 0.5 0 / 

is lumpable for A± = {1, 2}, A2 = {3, 4}, and A3 = {5}. 

6.57 Consider the random walk on Xn+i = {Xn + Cn)"^, with E[en] = /3. 

(a) Establish Lemma 6.70. {Hint: Consider an alternative V to V* and show 
by recurrence that 

V{x)> f K{x,y)V{y)dy A f K{x,y)V{y)dy 

Jc Jc^ 

> ••• > V"{x) .) 

(b) Establish Theorem 6.72 by assuming that there exists x* such that Px* {re 
< 00 ) < 1, choosing M such that M > V{x*)/[1 — Px*(tc < 00 )] and 
establishing that V{x*) > M[1 — Px*{rc < 00 )]. 

6.58 Show that 

(a) a time- homogeneous Markov chain (An) is stationary if the initial distribu- 
tion is the invariant distribution; 

(b) the invariant distribution of a stationary Markov chain is also the marginal 
distribution of any Xn- 

6.59 Referring to Section 6.7.1, let Xn be a Markov chain and h{-) a function with 

Eh{Xn) — 0, Yaih{Xn) — > 0, and E/i(An+i|xn) == h{xn), so h(-) is a 

nonconstant harmonic function. 

(a) Show that Eh{Xn-\-i\xo) — h{xo). 

(b) Show that Cov(/i(a;o), /i(An)) = cr^- 

(c) Use (6.52) to establish that Var h{Xi^ 00 as n ^ oc, show- 

ing that the chain is not ergodic. 

6.60 Show that if an irreducible Markov chain has a cr-finite invariant measure, this 
measure is unique up to a multiplicative factor. {Hint: Use Theorem 6.63.) 
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6.61 (Kemeny and Snell 1960) Show that for an aperiodic irreducible Markov chain 
with finite state-space and with transition matrix P, there always exists a sta- 
tionary probability distribution which satisfies 



7T = 7tP. 

(a) Show that if /? < 0, the random walk is recurrent. {Hint: Use the drift 
function V'(a:) = x as in Theorem 6.71.) 

(b) Show that if /3 — 0 and var(en) < oc, (Xn) is recurrent. {Hint: Use V{x) = 
log(l x) for X > R and V{x) = 0, otherwise, for an adequate bound R.) 

(c) Show that if /3 > 0, the random walk is transient. 

6.62 Show that if there exist a finite potential function V and a small set C such 
that V is bounded on C and satisfies (6.40), the corresponding chain is Harris 
positive. 

6.63 Show that the random walk on Z is transient when E[lUri] ^ 0. 

6.64 Show that the chains defined by the kernels (6.46) and (6.48) are either both 
recurrent or both transient. 

6.65 Referring to Example 6.66, show that the AR(1) chain is reversible. 

6.66 We saw in Section 6.6.2 that a stationary Markov chain is geometrically ergodic 
if there is a non-negative real- valued function M and a constant r < 1 such that 
for any A G A, 



\P{Xn G A|Ao eB)- P{Xn G A)\ < M{x)r". 

Prove that the following Central Limit Theorem (due to Chan and Geyer 1994) 
can be considered a corollary to Theorem 6.82 (see Note 6.9.4): 

Corollary 6.69. Suppose that the stationary Markov chain Xo,Xi,X 2 ,... is 
geometrically ergodic with M* = f \M{x)\f{x)dx < oo and satisfies the moment 
conditions of Theorem 6.82. Then 

— lim nvaiXn < oc 

n— >•00 

and if G^ > 0, y/nXniG tends in law to J\f{0,G^). 

{Hint: Integrate (with respect to /) both sides of the definition of geometric 
ergodicity to conclude that the chain has exponentially fast a-mixing, and apply 
Theorem 6.82.) 

6.67 Suppose that Xo,Xi,...,An have a common mean ^ and variance g^ and 
that coY{Xi^ Xj) = pj-i. For estimating show that 

(a) X may not be consistent if = p ^ 0 for all i ^ j. {Hint: Note that 
var(X) > 0 for all sufficiently large n requires p > 0 and determine the 
distribution of X in the multivariate normal case.) 

(b) X is consistent if \pj-i\ < with I7I < 1. 

6.68 For the situation of Example 6.84: 

(a) Prove that the sequence (Xn) is stationary provided g^ = 1/(1 — (fi). 

(b) Show that E(Xfc|a:o) = P^xq. {Hint: Consider E[(Xfc — pXk-i)\xo].) 

(c) Show that cov(Xo, Xk) = /{I — P^)> 

6.69 Under the conditions of Theorem 6.85, it follows that E[E(Xfc|Xo)]^ — > 0. 
There are some other interesting properties of this sequence. 
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(a) Show that 

var[E(Xfc|Xo)] = E[E(X,|Xo)]", 
var[E(Xfc|Xo)] > var[E(Xfc+ilXo)]] . 

{Hint: Write fk-\-i{y\x) = f fk{y\x')f(x'\x)dx' and use Fubini and Jensen.) 

(b) Show that 

E[var(Xfc|Xo)] < E[var(Xfc+i|Xo)] 

and that 

lim E[var(Xfc|Xo)] = al 

fc— >oo 

6.9 Notes 

6.9.1 Drift Conditions 

Besides atoms and small sets, Meyn and Tweedie (1993) rely on another tool to check 
or establish various stability results, namely, drift criteria^ which can be traced back 
to Lyapunov. Given a function V on the drift of V is defined by 

AV{x) = j V{y) P(x,dy) - V{x) . 

(Functions V appearing in this setting are often referred to as potentials; see Norris 
1997.) This notion is also used in the following chapters to verify the convergence 
properties of some MCMC algorithms (see, e.g.. Theorem 7.15 or Mengersen and 
Tweedie 1996). 

The following lemma is instrumental in deriving drift conditions for the tran- 
sience or the recurrence of a chain {Xn)> 

Lemma 6.70. If C ^ ^{X), the smallest positive function which satisfies the con- 
ditions 

(6.38) AV{x) <0 if x^C, V{x) >1 if x G C 

is given by 

V*{x) = Px{(TC < oo) , 

where ac denotes 

ac = inf{n > 0; Xn G C} . 

Note that, if x ^ ( 7 , crc = re, while ac = 0 on C. We then have the following 
necessary and sufficient condition. 

Theorem 6.71. The ^Ij - irreducible chain (Xn) is transient if and only if there exist 
a bounded positive function V and a real number r > 0 such that for every x for 
which V (x) > r, we have 



(6.39) 



AV{x) > 0 . 
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Proof. U C = {x; V (x) < r} and M is a bound on F , the conditions (6.38) are 
satisfied by 

^ ^ \l if xGC. 

Since V{x) < 1 for x G V*{x) = Px{rc < oo) < 1 on and this implies the 
transience of C, therefore the transience of (Xn). The converse can be deduced from 
a (partial) converse to Proposition 6.31 (see Meyn and Tweedie 1993, p. 190). □ 

Condition (6.39) describes an average increase of V{xn) once a certain level has 
been attained, and therefore does not allow a sure return to 0 of P. The condition is 
thus incompatible with the stability associated with recurrence. On the other hand, 
if there exists a potential function V “attracted” to 0, the chain is recurrent. 

Theorem 6.72. Consider (Xn) a xp -irreducible Markov chain. If there exist a small 
set C and a function V such that 

Cv{n) = {x;V{x) < n} 

is a small set for every n, the chain is recurrent if 

AV{x) <0 on C\ 

The fact that Cy (n) is small means that the function V is not bounded outside 
small sets. The attraction of the chain toward smaller values of V on the sets where 
V is large is thus a guarantee of stability for the chain. The proof of the above result 
is, again, quite involved, based on the fact that Px(tc < oo) = 1 (see Meyn and 
Tweedie 1993, p. 191). 

Example 6.73. (Continuation of Example 6.39) If the distribution of Wn has a 

finite support and zero expectation, (Xn) is recurrent. When considering V{x) = |a:l 
and r such that 7 x = 0 for |x| > r, we get 

r 

AV{x)= ^ 7n(|a: + n| - |x|) , 



which is equal to 

r r 

7 nn if X > r and — 7 nn if x < —r . 

n=—r n=—r 

Therefore, AV{x) = 0 for x ^ {— r + 1, . . . , r — 1}, which is a small set. Conversely, 
if Wn has a nonzero mean, Xn is transient. 1| 

For Harris recurrent chains, positivity can also be related to a drift condition 
and to a “regularity” condition on visits to small sets. 

Theorem 6.74. If {Xn) is Harris recurrent with invariant measure tt, there is equiv- 
alence between 

(a) 7T is finite; 
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(b) there exist a small set C and a positive number Me such that 

sup Ex[rc] < Me ; 
xec 

(c) there exist a small set C , a function V taking values in RU {oo}, and a positive 
real number b such that 

(6.40) AV(x) < -1-f Mc(x) . 

See Meyn and Tweedie (1993, Chapter 11) for a proof and discussion of these 
equivalences. (If there exists V finite and bounded on C which satisfies (6.40), the 
chain (Xn) is necessarily Harris positive.) 

The notion of a Kendall atom introduced in Section 6.6.2 can also be extended 
to non-atomic chains by defining Kendall sets as sets A such that 

(6.41) sup Ex 

xeA 

with K > 1. The existence of a Kendall set guarantees a geometric drift condition. 
If C is a Kendall set and if 

k(x)-ExK^], 

the function V satisfies 



E 

L k=0 



< oo 



(6.42) 



AV(x) < -f3V{x) + bIe{x) 



with /3 > 0 and 0 < 6 < oo. This condition also guarantees geometric convergence 
for (Xn) in the following way. 



Theorem 6.75. For a xf -irreducible and aperiodic chain (Xn) and a small Kendall 
set C , there exist R < oo and r > 1, k > 1 such that 



CO 



TC 



(6.43) 



r”||/s:”(x, •) - 7r(-)|i <RE^ 



E 



< oo 



n=l 

for almost every x £ X . 



lk=0 



The three conditions (6.41), (6.42) and (6.43) are, in fact, equivalent for xf- 
irreducible aperiodic chains if A is a small set in (6.41) and if V is bounded from 
below by 1 in (6.42) (see Meyn and Tweedie 1993, pp. 354-355). The drift condi- 
tion (6.42) is certainly the simplest to check in practice, even though the potential 
function V must be derived. 



Example 6.76. (Continuation of Example 6.20) The condition |^| < 1 is 

necessary for the chain Xn = Oxn-i + Sn to be recurrent. Assume £n has a strictly 
positive density on R. Define V{x) = \x\ + 1. Then 

Ex[V{Xi)] = lFE[\eX-£ei\] 

<1 + 1^1 \x\ +E[|ei|] 

— 1^1 V{x) +E[|£i|] + 1 - I^I 



and 
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AV(x) < -(1 - |6>|) V{x) + E[\ei\] + 1 - |0| 

= -(1 - |0|) ^V(x) + E[|£i|] + 1 - I^I - (1 - ^)(1 - |01) V{x) 

< -pV{x) + bIc(x) 

for /3 = {I — 1^1) 7, 6 = E[l£i|] + 1 — |^|, and C equal to 

C = {x; V{x) < (E[|£i|] + 1 - |0|)/(1 - 7)(1 - |0|)} , 

if 1^1 < 1 and E[[£:il] < +oo. These conditions thus imply geometric ergodicity for 
AR(1) models. || 



Meyn and Tweedie (1994) propose, in addition, explicit evaluations of conver- 
gence rates r as well as explicit bounds R in connection with drift conditions (6.42), 
but the geometric convergence is evaluated under a norm induced by the very func- 
tion V satisfying (6.42), which makes the result somewhat artificial. 

There also is an equivalent form of uniform ergodicity involving drift, namely 
that (Xn) is aperiodic and there exist a small set C, a bounded potential function 
y > 1 , and constants 0 < 6 < oo and P > 0 such that 

(6.44) AV{x) < -PV{x) + blc(x) , xeX . 

In a practical case (see, e.g.. Example 12.6), this alternative to the conditions of 
Theorem 6.59 is often the most natural approach. 

As mentioned after Theorem 6.64, there exist alternative versions of the Central 
Limit Theorem based on drift conditions. Assume that there exist a function / > 1, 
a finite potential function V, and a small set C such that 

(6.45) AV{x) < -f{x) + blc{x), x G A', 

and that E^[y^] < oo. This is exactly condition (6.44) above, with f = V, which 
implies that (6.45) holds for an uniformly ergodic chain. 

Theorem 6.77. If the ergodic chain {Xn) with invariant distribution tt satisfies 
conditions (6.45), for every function g such that \g\ < f, then 

7 g = lim nE,r[5^(g)] 

n—*oo 

oo 

= E,r[ff^(a:o)] + 2 E,r[s(a;o)5(a:^fc)] 

fc=l 

is non-negative and finite. If > 0, the Central Limit Theorem holds for Sn{g)- If 
7 g = 0, y/nSn{g) almost surely goes to 0. 

This theorem is definitely relevant for convergence assessment of Markov chain 
Monte Carlo algorithms since, when 7 ^ > 0, it is possible to assess the convergence 
of the ergodic averages Sn{g) to the quantity of interest E^[p]. Theorem 6.77 also 
suggests how to implement this monitoring through renewal theory, as discussed in 
detail in Chapter 12 . 
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6.9.2 Eaton’s Admissibility Condition 

Eaton (1992) exhibits interesting connections, similar to Brown (1971), between the 
admissibility of an estimator and the recurrence of an associated Markov chain. The 
problem considered by Eaton (1992) is to determine whether, for a bounded function 
g(0), a generalized Bayes estimator associated with a prior measure tt is admissible 
under quadratic loss. Assuming that the posterior distribution 7t(^|x) is well defined, 
he introduces the transition kernel 

(6.46) K{6,ri)= f n{0\x)f(x\r])dx, 

J X 

which is associated with a Markov chain generated as follows: The transition 

from 9^^"^ to is done by generating first x ~ f{x\9^'^^) and then 

7t(^|x). (Most interestingly, this is also a kernel used by Markov Chain Monte Carlo 
methods, as shown in Chapter 9.) Note that the prior measure tt is an invariant 
measure for the chain (^^^^). For every measurable set C such that 7 t(C) < +oo, 
consider 

V{C) = {/i G C^{'K)\h{9) > 0 and h(9) > 1 when ^ G C} 

and 

= J j {h{9) - h{ri)Y K {9, ri)-K{ri)de dr}. 

The following result then characterizes admissibility for all hounded functions in 
terms of A and V(C) (that is, independently of the estimated functions g). 

Theorem 6.78. If for every C such that 7 t{C) < +oo, 

(6.47) inf A(h) = 0, 

hev(C) 

then the Bayes estimator K'^[g(0)\x] is admissible under quadratic loss for every 
hounded function g. 

This result is obviously quite general but only mildly helpful in the sense that 
the practical verification of (6.47) for every set C can be overwhelming. Note also 
that (6.47) always holds when tt is a proper prior distribution since h = 1 belongs 
to and A(l) = 0 in this case. The extension then considers approximations of 

1 by functions in V{C). Eaton (1992) exhibits a connection with the Markov chain 
(^^^^^), which gives a condition equivalent to Theorem 6.78. First, for a given set (7, 
a stopping rule re is defined as the first integer n > 0 such that {9^'^^) belongs to C 
(and +00 otherwise), as in Definition 6.10. 

Theorem 6.79. For every set C such that 7r(C) < -f-oo, 

^ = v)]T^iv)dri- 

Therefore, the generalized Bayes estimators of hounded functions of 9 are admissible 
if and only if the associated Markov chain (9^^^) is recurrent. 
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Again, we refer to Eaton (1992) for extensions, examples, and comments on this 
result. Note, however, that the verification of the recurrence of the Markov chain 
is much easier than the determination of the lower bound of A{h). Robert 
and Robert (1999) consider the potential of using the dual chain based on the kernel 

(6.48) K\x,y)= f f{y\6)7r{e\x)de 

Je 

(see Problem 6.64) and derive admissibility results for various distributions of inter- 
est. 

6.9.3 Alternative Convergence Conditions 



Athreya et al. (1996) present a careful development of the basic limit theorems for 
Markov chains, with conditions stated that are somewhat more accessible in Markov 
chain Monte Carlo uses, rather than formal probabilistic properties. 

Consider a time- homogeneous Markov chain (An) where / is the invariant density 
and /fc(-|') is the conditional density of Xk given Xq. So, in particular, /i('|-) is the 
transition kernel. For a basic limit theorem such as Theorem 6.51, there are two 
conditions that are required on the transition kernel, both of which have to do with 
the ability of the Markov chain to visit all sets A. Assume that the transition kernel 
satisfies^ : There exists a set A such that 

(i) 'H'k=i I A fk(3:\xo) dfi(x) > 0 for all xo, 

(ii) infx,yeA fi{y\x) > 0. 

A set A satisfying (i) is called accessible^ which means that from anywhere in the 
state space there is positive probability of eventually entering A. Condition (ii) is 
essentially a minorization condition. The larger the set A, the easier it is to verify 
(i) and the harder it is to verify (ii) . These two conditions imply that the chain is 
irreducible and aperiodic. It is possible to weaken (ii) to a condition that involves 
fk for some /c > 1; see Athreya et al. (1996). 

The limit theorem of Athreya et al. (1996) can be stated as follows. 

Theorem 6.80. Suppose that the Markov chain (Xn) has invariant density /(•) and 
transition kernel /i(-|-) that satisfies Conditions (i) and (ii). Then 



(6.49) 



lim sup 

k—koo ^ 



/ fk{x\xo)dx- I 
Ja Ja 



f{x)dx 



= 0 



for f almost all xq. 



6.9.4 Mixing Conditions and Central Limit Theorems 

In Section 6.7.2, we established a Central Limit Theorem using regeneration, which 
allowed us to use a typical independence argument. Other conditions, known as mix- 
ing conditions^ can also result in a Central Limit Theorem. These mixing conditions 
guarantee that the dependence in the Markov chain decreases fast enough, and vari- 
ables that are far enough apart are close to being independent. Unfortunately, these 
conditions are usually quite difficult to verify. Consider the property of a-mixing 
(Billingsley 1995, Section 27). 

^ The conditions stated here are weaker than those given in the first edition; we 
thank Hani Doss for showing us this improvement. 
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Definition 6.81. A sequence Xo,Xi,X 2 , ... is a-mixing if 

(6.50) On = sup \P{Xn e A, XoeB)- P{Xn e A)P{Xo e B)\ 

A,B 

goes to 0 when n goes to infinity. 

So, we see that an a-mixing sequence will tend to “look independent” if the 
variables are far enough apart. As a result, we would expect that Theorem 6.63 
is a consequence of a-mixing. This is, in fact, the case, as every positive recurrent 
aperiodic Markov chain is a-mixing (Rosenblatt 1971, Section VII. 3), and if the 
Markov chain is stationary and a-mixing, the covariances go to zero (Billingsley 
1995, Section 27). 

However, for a Central Limit Theorem, we need even more. Not only must the 
Markov chain be a-mixing, but we need the coefficient an to go to 0 fast enough; 
that is, we need the dependence to go away fast enough. One version of a Markov 
chain Central Limit Theorem is the following (Billingsley 1995, Section 27). 

Theorem 6.82. Suppose that the Markov chain (Xn) is stationary and a-mixing 
with an = 0{n~^) and that E[Xn] — 0 and < oo. Then, 

(j^ — lim n varAn < oo 

n—yoo 

and if > 0, y/nXn tends in law to A7(0,cr^). 

This theorem is not very useful because the condition on the mixing coefficient 
is very hard to verify. (Billingsley 1995 notes that the conditions are stronger than 
needed, but are imposed to avoid technical difficulties in the proof.) Others have 
worked hard to get the condition in a more accessible form and have exploited the 
relationship between mixing and ergodicity. Informally, if (6.50) goes to 0, dividing 
through by P(Xq e B) we expect that 

(6.51) \P{Xn G A|Ao G B) - P{Xn G A)\ ^ 0, 

which looks quite similar to the assumption that the Markov chain is ergodic. (This 
corresponds, in fact, to a stronger type of mixing called (3-mixing. See Bradley 1986). 
We actually need something stronger (see Problem 7.6), like uniform ergodicity, 
where there are constants M and r < 1 such \P{Xn G A\Xq e B) - P{Xn G A)| < 
Mr^. Tierney (1994) presents the following Central Limit Theorem. 

Theorem 6.83. Let (Xn) be a stationary uniformly ergodic Markov chain. For any 
function h(*) satisfying var h{Xi) — cr^ < oo, there exists a real number th such that 

Th 

Other versions of the Central Limit Theorem exist. See, for example, Robert 
(1994), who surveys other mixing conditions and their connections with the Central 
Limit Theorem. 
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6.9.5 Covariance in Markov Chains 

An application of Chebychev’s inequality shows that the convergence of an average 
of random variables from a Markov chain can be connected to the behavior of the 
covariances, with a sufficient condition for convergence in probability being that the 
covariances go to zero. 

We assume that the Markov chain is Harris positive and aperiodic, and is sta- 
tionary. We also assume that the random variables of the chain have finite variance. 
Thus, let (Xn) be a stationary ergodic Markov chain with mean 0 and finite variance. 
The variance of the average of the X^’s is 

(6.52) var ELi ^cov(Xo, X,) , 

SO the covariance term in (6.52) will go to zero if cov(Xo, Xk)/n goes to zero, 
and a sufficient condition for this is that cov(Xo, Xfc) converges to 0 (Problem 6.38). 
To see when cov(Xo,Xfc) converges to 0, write 

|cov(Xo,Xfc)| = |E[XoXfe]| 

= |E[XoE(Xfc|Xo)]| 

(6.53) < [E(Xo^)]'/"{E[E(Xfc|Xo)]^}'/", 

where we used the Cauchy- Schwarz inequality. Since E(Xq) = cr^, cov(Xo,Xfc) will 
go to zero if E[E(Xfc|Xo)]^ goes to 0. 

Example 6.84. (Continuation of Example 6.6) Consider the AR(1) model 

(6.54) Xk = OXk-i + €k, /c = 0, ...,n, 

when the e^’s are iid A/’(0, 1), 9 is an unknown parameter satisfying |^| < 1, and 
Xo ~ A/*(0, (j^). The X^’s all have marginal normal distributions with mean zero. 
The variance of Xk satisfies var(Xfc) = ^^var(Xfc_i) -h 1 and, var(Xfc) = for all /c, 
provided = 1/{1 — 9^). This is the stationary case in which it can be shown that 

(6.55) E(Xfc|Xo) = 6>^Xo 

and, hence, E[E(Xfc|Xo)]^ = 9^^cr^, which goes to zero as long as |0| < 1. Thus, 
var(X) converges to 0. (See Problem 6.68.) || 

Returning to (6.53), let M be a positive constant and write 

E[E(Xfc|Xo)]" = E[E(Xax,>M|Xo) + E(XfcIx,<M|Xo)]^ 

(6.56) < 2E[E(Xax,>M|Xo)]" + 2E[E(Xax,<M|Xo)]^ 

Examining the two terms on the right side of (6.56), the first term can be made 
arbitrarily small using the fact that Xk has finite variance, while the second term 
converges to zero as a consequence of Theorem 6.51. We formalize this in the fol- 
lowing theorem. 

Theorem 6.85. If the Markov chain (Xn) is positive and aperiodic, mt/i var(Xn) < 
oo, then cov(Xo,Xfe) converges to 0. 
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The Metropolis— Hastings Algorithm 



“What’s changed, except what needed changing?” And there was something 
in that, Cadfael reflected. What was changed was the replacement of falsity 
by truth... 

— Ellis Peter, The Confession of Brother Haluin 

This chapter is the first of a series on simulation methods based on Markov 
chains. However, it is a somewhat strange introduction because it contains a 
description of the most general algorithm of all. The next chapter (Chapter 
8) concentrates on the more specific slice sampler, which then introduces the 
Gibbs sampler (Chapters 9 and 10), which, in turn, is a special case of the 
Metropolis-Hastings algorithm. (However, the Gibbs sampler is different in 
both fundamental methodology and historical motivation.) 

The motivation for this reckless dive into a completely new and general 
simulation algorithm is that there exists no simple case of the Metropolis- 
Hastings algorithm that would “gently” explain the fundamental principles of 
the method; a global presentation does, on the other hand, expose us to the 
almost infinite possibilities offered by the algorithm. 

Unfortunately, the drawback of this ordering is that some parts of the chap- 
ter will be completely understood only after reading later chapters. But realize 
that this is the pivotal chapter of the book, one that addresses the methods 
that radically changed our perception of simulation and opened countless new 
avenues of research and applications. It is thus worth reading this chapter more 
than once! 



7.1 The MCMC Principle 

It was shown in Chapter 3 that it is not necessary to directly simulate a 
sample from the distribution / to approximate the integral 
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^ ~ J h{x)f{x)dx , 

since other approaches like importance sampling can be used. While Chapter 
14 will clarify the complex connections existing between importance sam- 
pling and Markov chain Monte Carlo methods, this chapter first develops a 
somewhat different strategy and shows that it is possible to obtain a sam- 
ple ATi, . . . , Xn approximately distributed from / without directly simulating 
from /. The basic principle underlying the methods described in this chap- 
ter and the following ones is to use an ergodic Markov chain with stationary 
distribution f. 

While we will discuss below some rather general schemes to produce valid 
transition kernels associated with arbitrary stationary distributions, the work- 
ing principle of MCMC algorithms is thus as follows: For an arbitrary starting 
value , a chain is generated using a transition kernel with stationary 

distribution /, which ensures the convergence in distribution of to a 

random variable from /. (Given that the chain is ergodic, the starting value 
is, in principle, unimportant.) 

Definition 7.1. A Markov chain Monte Carlo (MCMC) method for the sim- 
ulation of a distribution / is any method producing an ergodic Markov chain 
whose stationary distribution is /. 

This simple idea of using a Markov chain with limiting distribution / may 
sound impractical. In comparison with the techniques of Chapter 3, here we 
rely on more complex asymptotic convergence properties than a simple Law 
of Large Numbers, as we generate dependencies within the sample that slow 
down convergence of the approximation of 3. Thus, the number of iterations 
required to obtain a good approximation seems a priori much more important 
than with a standard Monte Carlo method. The appeal to Markov chains is 
nonetheless justified from at least two points of view. First, in Chapter 5, we 
have already seen that some stochastic optimization algorithms (for exam- 
ple, the Robbins-Monro procedure in Note 5.5.3) naturally produce Markov 
chain structures. It is a general fact that the use of Markov chains allows 
for a greater scope than the methods presented in Chapters 2 and 3. Sec- 
ond, regular Monte Carlo and MCMC algorithms both satisfy the 0(1/ y^) 
convergence requirement for the approximation of 3. There are thus many 
instances where a specific MCMC algorithm dominates, variance- wise, the 
corresponding Monte Carlo proposal. For instance, while importance sam- 
pling is virtually a universal method, its efficiency relies of adequate choices 
of the importance function and this choice gets harder and harder as the di- 
mension increases, a practical realization of the curse of dimensionality. At a 
first level, some generic algorithms, like the Metropolis-Hastings algorithms, 
also use simulations from almost any arbitrary density g to actually generate 
from an equally arbitrary given density /. At a second level, however, since 
these algorithms allow for the dependence of g on the previous simulation, the 
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choice of g does not require a particularly elaborate construction a priori but 
can take advantage of the local characteristics of the stationary distribution. 
Moreover, even when an Accept-Reject algorithm is available, it is sometimes 
more efficient to use the pair (/, ^) through a Markov chain, as detailed in 
Section 7.4. Even if this point is not obvious at this stage, it must be stressed 
that the (re) discovery of Markov chain Monte Carlo methods by statisticians 
in the 1990s has produced considerable progress in simulation-based inference 
and, in particular, in Bayesian inference, since it has allowed the analysis of a 
multitude of models that were too complex to be satisfactorily processed by 
previous schemes. 



7.2 Monte Carlo Methods Based on Markov Chains 

Despite its formal aspect. Definition 7.1 can be turned into a working prin- 
ciple: the use of a chain produced by a Markov chain Monte Carlo 

algorithm with stationary distribution / is fundamentally identical to the use 
of an iid sample from / in the sense that the ergodic theorem (Theorem 6.63) 
guarantees the (almost sure) convergence of the empirical average 

(7.1) ^ E 

t=i 

to the quantity Ef[h{X)]. A sequence (A^^^) produced by a Markov chain 
Monte Carlo algorithm can thus be employed just as an iid sample. If there is 
no particular requirement of independence but if, rather, the purpose of the 
simulation study is to examine the properties of the distribution /, there is no 
need for the generation of n independent chains (a|^^) (i = 1, . . . ,n), where 
only some “terminal” values are kept: the choice of the value Tq may 

induce a bias and, besides, this approach results in the considerable waste of 
n(To — 1) simulations out of nTo. In other words, a single realization (or path) 
of a Markov chain is enough to ensure a proper approximation of 3 through 
estimates like (7.1) for the functions h of interest (and sometimes even of 
the density /, as detailed in Chapter 10). Obviously, handling this sequence 
is somewhat more arduous than in the iid case because of the dependence 
structure, but some approaches to the convergence assessment of (7.1) are 
given in Section 7.6 and in Chapter 12. Chapter 13 will also discuss strategies 
to efficiently produce iid samples with MCMC algorithms. 

Given the principle stated in Definition 7.1, one can propose an infinite 
number of practical implementations as those, for instance, used in statistical 
physics. The Metropolis-Eastings algorithms described in this chapter have 
the advantage of imposing minimal requirements on the target density / and 
allowing for a wide choice of possible implementations. In contrast, the Gibbs 
sampler described in Chapters 8-10 is more restrictive, in the sense that it 
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requires some knowledge of the target density to derive some conditional den- 
sities, but it can also be more effective than a generic Metropolis-Hastings 
algorithm. 

7.3 The Metropolis-Hastings algorithm 

Before illustrating the universality of Metropolis-Hastings algorithms and 
demonstrating their straightforward implementation, we first address the (im- 
portant) issue of theoretical validity. Since the results presented below are 
valid for all types of Metropolis-Hastings algorithms, we do not include ex- 
amples in this section, but rather wait for Sections 7.4 and 7.5, which present 
a collection of specific algorithms. 

7.3.1 Definition 

The Metropolis-Hastings algorithm starts with the objective (target) density 
/. A conditional density q{y\x)^ defined with respect to the dominating mea- 
sure for the model, is then chosen. The Metropolis-Hastings algorithm can 
be implemented in practice when q{'\x) is easy to simulate from and is ei- 
ther explicitly available (up to a multiplicative constant independent of x) or 
symmetric] that is, such that q{x\y) = q{y\x). The target density / must be 
available to some extent: a general requirement is that the ratio 

fiy)/Q{y\x) 

is known up to a constant independent of x. 

The Metropolis-Hastings algorithm associated with the objective (tar- 
get) density / and the conditional density q produces a Markov chain (X^^^) 
through the following transition. 

Algorithm A. 24 —Metropolis-Hastings— 



Given , 






1. Generate 

2. Take 








\Yi with probability 

with probability 




where 




[AM] 
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The distribution q is called the instrumental (or proposal) distribution and 
the probability p(x, y) the Metropolis-Hastings acceptance probability. 

This algorithm always accepts values yt such that the ratio f{yt)/Q{yt\^^^^) 
is increased, compared with the previous value f{x^^^)/{q{x^^^\yt). It is only 
in the symmetric case that the acceptance is driven by the objective ratio 
An important feature of the algorithm [A. 24] is that it may 
accept values yt such that the ratio is decreased, similar to stochastic op- 
timization methods (see Section 5.4). Like the Accept-Reject method, the 
Metropolis-Hastings algorithm depends only on the ratios 

f{yt)/f{x^*^) and q{x^^^\yt)/q{yt\x^^^) 

and is, therefore, independent of normalizing constants, assuming, again, that 
q{'\x) is known up to a constant that is independent of x^ . 

Obviously, the probability p{x^^\yt) is defined only when > 0. 

However, if the chain starts with a value such that f{x^^^) > 0, it follows 
that /(x^^^) > 0 for every t G N since the values of yt such that f{yt) = 0 lead 
to p{x^^\yt) = 0 and are, therefore, rejected by the algorithm. We will make 
the convention that the ratio p(x, y) is equal to 0 when both /(x) and f[y) 
are null, in order to avoid theoretical difficulties. 

There are similarities between [A. 24] and the Accept-Reject methods of 
Section 2.3, and it is possible to use the algorithm [A. 24] as an alternative 
to an Accept-Reject algorithm for a given pair (/, ^). These approaches are 
compared in Section 7.4. However, a sample produced by [A. 24] differs from 
an iid sample. For one thing, such a sample may involve repeated occurrences 
of the same value, since rejection of Yt leads to repetition of at time 
t + 1 (an impossible occurrence in absolutely continuous iid settings). Thus, 
in calculating a mean such as (7.1), the It’s generated by the algorithm [A. 24] 
can be associated with weights of the form mt/T (mt = 0, 1, . . .), where uit 
counts the number of times the subsequent values have been rejected. (This 
makes the comparison with importance sampling somewhat more relevant, as 
discussed in Section 7.6 and Chapter 14.) 

While [A. 24] is a generic algorithm, defined for all /’s and g’s, it is nonethe- 
less necessary to impose minimal regularity conditions on both / and the 
conditional distribution q for / to be the limiting distribution of the chain 
(X^^^) produced by [A. 24]. For instance, it is easier if 5, the support of /, is 
connected: an unconnected support S can invalidate the Metropolis-Hastings 
algorithm. For such supports, it is necessary to proceed on one connected 
component at a time and show that the different connected components of £ 
are linked by the kernel of [A. 24]. If the support of £ is truncated by that 
is, if there exists A C £ such that 

/ f{x)dx > 0 and / q{y\x)dy = 0 , Vx G , 

Ja Ja 

^ If we insist on this independence from x, it is because forgetting a term in q{'\x) 
that depends on x does jeopardize the validity of the whole algorithm. 
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the algorithm [A. 24] does not have / as a limiting distribution since, for 
^ A, the chain never visits A. Thus, a minimal necessary condition 

is that 

U supp q{-\x) D supp / . 

a;Gsupp / 

To see that / is the stationary distribution of the Metropolis chain, we 
first examine the Metropolis kernel more closely and find that it satisfies the 
detailed balance property (6.22). (See Problem 7.3 for details of the proof.) 

Theorem 7.2. Let (X^^^) be the chain produced by [A. 24]. For every condi- 
tional distribution q whose support includes 8, 

(a) the kernel of the chain satisfies the detailed balance condition with f; 

(b) f is a stationary distribution of the chain. 

Proof The transition kernel associated with [A. 24] is 

(7.2) K{x,y) = p{x,y)q{y\x) + (1 - r(®))4(2/) , 

where r{x) = f p(x,y)q(y\x)dy and Sx denotes the Dirac mass in x. It is 
straightforward to verify that 

p(x,y)q(yjx)f(x) = p(y,x)q(xly)f(y) 

(7.3) 

(1 - r(x))dx(y)f(x) = (1 - r{y))Sy{x)f{y) , 

which together establish detailed balance for the Metropolis-Hastings chain. 
Part (b) now follows from Theorem 6.46. □ 

The stationarity of / is therefore established for almost any conditional 
distribution q, a fact which indicates the universality of Metropolis-Hastings 
algorithms. 

7.3.2 Convergence Properties 

To show that the Markov chain of [A. 24] indeed converges to the stationary 
distribution and that (7.1) is a convergent approximation to 3, we need to 
apply further the theory developed in Chapter 6. 

Since the Metropolis-Hastings Markov chain has, by construction, an in- 
variant probability distribution /, if it is also an aperiodic Harris chain (see 
Definition 6.32), then the ergodic theorem (Theorem 6.63) does apply to es- 
tablish a result like the convergence of (7.1) to 3. 

A sufficient condition for the Metropolis-Hastings Markov chain to be 
aperiodic is that the algorithm [A. 24] allows events such as 
that is, that the probability of such events is not zero, and thus 
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(7.4) 



P 



/(xW) g(r*|xW) < /(y*) 



< 1 . 



Interestingly, this condition implies that q is not the transition kernel of a 
reversible Markov chain with stationary distribution /.^ (Note that q is not 
the transition kernel of the Metropolis-Hastings chain, given by (7.2), which 
is reversible.) 

The fact that [A. 24] works only when (7.4) is satisfied is not overly trou- 
blesome, since it merely states that it is useless to further perturb a Markov 
chain with transition kernel q if the latter already converges to the distribution 
/. It is then sufficient to directly study the chain associated with q. 

The property of irreducibility of the Metropolis-Hastings chain (X^^^) fol- 
lows from sufficient conditions such as positivity of the conditional density q; 
that is. 



(7.5) q[y\x) > 0 for every (x, y) ^ S x 

since it then follows that every set of £ with positive Lebesgue measure can 
be reached in a single step. As the density / is the invariant measure for the 
chain, the chain is positive (see Definition 6.35) and Proposition 6.36 implies 
that the chain is recurrent. We can also establish the following stronger result 
for the Metropolis-Hastings chain. 

Lemma 7.3. If the Metropolis-Hastings chain is f -irreducible, it is 

Harris recurrent. 



Proof. This result can be established by using the fact that a characteristic of 
Harris recurrence is that the only bounded harmonic functions are constant 
(see Proposition 6.61). 

If is a harmonic function, it satisfies 

h{xo) = E[/i(X(i))|xo] = E[/i(X(*))|xo] . 

Because the Metropolis-Hastings chain is positive recurrent and aperiodic, we 
can use Theorem 6.80, as in the discussion surrounding (6.27), and conclude 
that h is /-almost everywhere constant and equal to Ef[h{X)]. To show that 
h is everywhere constant, write 

E[/i(X(i))|xo] = / p{xo,Xi) q{xi\xo) h{xi)dxi -h (1 - r(xo)) h{xo) , 

and substitute Kh{X) for h{xi) in the integral above. It follows that 

Ef[h{X)] r(xo) + (1 - r{xo)) h{xo) = h{xo) ; 

that is, {h{xo) — E[h{X)]) r(xo) = 0 for every xq G £. Since r(xo) > 0 for 
every xq G £ , hy virtue of the /-irreducibility, h is necessarily constant and 
the chain is Harris recurrent. □ 

^ For instance, (7.4) is not satisfied by the successive steps of the Gibbs sampler 
(see Theorem 10.13). 




274 7 The Metropolis-Hastings Algorithm 



We therefore have the following convergence result for Metropolis-Hastings 
Markov chains. 

Theorem 7.4. Suppose that the Metropolis-Hastings Markov chain is 

f -irreducible. 

(i) If h e L^if), then 



= / h{x)f{x)dx a.e. f. 



(ii) If in addition, is aperiodic, then 



lim 

n— >oo 






K^{xrUdx)-f 



= 0 

TV 



for every initial distribution fi, where K^{x,') denotes the kernel for n 
transitions, as in (6.5). 

Proof. If (X^^^) is /-irreducible, it is Harris recurrent by Lemma 7.3 , and 
part (i) then follows from Theorem 6.63 (the Ergodic Theorem). Part (ii) is 
an immediate consequence of Theorem 6.51. □ 



As the /-irreducibility of the Metropolis-Hastings chain follows from the 
above-mentioned positivity property of the conditional density q, we have the 
following immediate corollary, whose proof is left as an exercise. 

Corollary 7.5. The conclusions of Theorem l.f hold if the Metropolis-Has- 
tings Markov chain (X^*^) has conditional density q{x\y) that satisfies (7.4) 
and (7.5). 

Although condition (7.5) may seem restrictive, it is often satisfied in prac- 
tice. (Note that, typically, conditions for irreducibility involve the transition 
kernel of the chain, as in Theorem 6.15 or Note 6.9.3.) 

We close this section with a result due to Roberts and Tweedie (1996) 
(see Problem 7.35) which gives a somewhat less restrictive condition for irre- 
ducibility and aperiodicity. 

Lemma 7.6. Assume f is bounded and positive on every compact set of its 
support £. If there exist positive numbers e and 6 such that 

(7.6) q{y\x) > s if \x-y\<S , 

then the Metropolis-Hastings Markov chain is f -irreducible and aperi- 

odic. Moreover, every nonempty compact set is a small set. 
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The rationale behind this result is the following. If the conditional dis- 
tribution q{y\x) allows for moves in a neighborhood of with diameter 
bounded from below and if / is such that p{x^^^ , y) is positive in this neigh- 
borhood, then any subset of S can be visited in k steps for k large enough. 
(This property obviously relies on the assumption that £ is connected.) 

Proof. Consider x^^^ an arbitrary starting point and A C £ a.n arbitrary 
measurable set. The connectedness of £ implies that there exist m G N and 
a sequence x^'^^ e £ {1 < i < m) such that x^'^^ G A and < 6. 

It is therefore possible to link and A through a sequence of balls with 
radius S. The assumptions on f imply that the acceptance probability of 
a point of the zth ball starting from the {i — l)st ball is positive and, 
therefore, P^q^{A) = P{X^'^^ G > 0. By Theorem 6.15, the 

/-irreducibility of is established. 

For an arbitrary value G £ and for every y G B{x^^\5/2) (the ball 
with center and radius 5/2) we have 

Py{A) > p{y, z) q{z\y) dz 

^ I ® dz+ f q{z\y) dz , 

JAnDy j[y) Jahd^ 

where Dy = {z; f{z)q{y\z) < f{y)q{z\y)}. It therefore follows that 



> [ 

JA 



> 



^ q{y\z)dz + f 

Ar\D^r\B J\y) JA 



infs f{x) 



AnDfnB 



q{z\y)dz 



> e 



sup^ f{x) jAnDynB 
infs f{x) 
supjB f{x) 



f q{y\z)dz+ [ 

J A(~\D yC\B J A 

A(A n B) , 



AnD9,nB 



q(zjy)dz 



where A denotes the Lebesgue measure on £. The balls B(x^^\S/2) are small 
sets associated with uniform distributions on B{x^^\5 /2). This simultane- 
ously implies the aperiodicity of (X^^^) and the fact that every compact set is 
small. □ 



Corollary 7.7. The conclusions of Theorem l.f. hold if the Metropolis-Hastings 
Markov chain has invariant probability density f and conditional den- 

sity q{x\y) that satisfy the assumptions of Lemma 7.6. 

One of the most fascinating aspects of the algorithm [A. 24] is its universal- 
ity; that is, the fact that an arbitrary conditional distribution q with support 
£ can lead to the simulation of an arbitrary distribution f on £. On the other 
hand, this universality may only hold formally if the instrumental distribution 
q rarely simulates points in the main portion of £; that is to say, in the region 
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where most of the mass of the density / is located. This issue of selecting a 
good proposal distribution q for a given / is detailed in Section 7.6. 

Since we have provided no examples so far, we now proceed to describe 
two particular approaches used in the literature, with some probabilistic prop- 
erties and corresponding examples. Note that a complete classification of 
the Metropolis-Hastings algorithms is impossible, given the versatility of the 
method and the possibility of creating even more hybrid methods (see, for 
instance, Roberts and Tweedie 1995, 2004 and Stramer and Tweedie 1999b). 



7.4 The Independent Metropolis-Hastings Algorithm 

7.4.1 Fixed Proposals 

This method appears as a straightforward generalization of the Accept-Reject 
method in the sense that the instrumental distribution q is independent of 
and is denoted g by analogy. The algorithm [A. 24] will then produce the 
following transition from to 

Algorithm A. 2 5 -Independent Metropolis-Hastings^ 

Given 

1 Generate Yt ^g(p). 

2 Take [A.25j 

with probability ,l| 

Otherwise. 



Although the F^’s are generated independently, the resulting sample is not 
iid: for instance, the probability of acceptance of Yt depends on X^^^ (except 
in the trivial case when f = g). 

The convergence properties of the chain (X^^^) follow from properties of 
the density g in the sense that (X^^^) is irreducible and aperiodic (thus, ergodic 
according to Corollary 7.5) if and only if g is almost everywhere positive on the 
support of /. Stronger properties of convergence like geometric and uniform 
ergodicity are also clearly described by the following result of Mengersen and 
Tweedie (1996). 

Theorem 7.8. The algorithm [A. 25] produces a uniformly ergodic chain if 
there exists a constant M such that 

(7.7) f{x) < Mg{x) , \!x G supp /. 



In this case. 
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(7-8) \\K’^{x,-)-f\\TV<2[l--j , 

where || • \ \tv denotes the total variation norm introduced in Definition 6.47. 
On the other hand, if for every M , there exists a set of positive measure where 
(7.7) does not hold, is not even geometrically ergodic. 

Proof. If (7.7) is satisfied, the transition kernel satisfies 

The set S is therefore small and the chain is uniformly ergodic (Theorem 
6.59). 

To establish the bound on \\K^{x, •) — fWrVi fii*st write 

l|7^(a:,-) -/lirv = 2sup [ (K{x,y) - f{y))dy 
^ Ja 

= 2/" if{y) - K{x,y))dy 

d{y]f{y)>K{x,y)} 



/ f(y)dy 

\ ^ / J{yJ(y)>K(x,y)} 



We now continue with a recursion argument to establish (7.8). We can write 

f {K‘^{x,y) - f{y))dy = [ ( {K{u,y) - f{y))dy 

JA JE Ua 

(7.10) X {K{x,u) — f{u))du, 

and an argument like that in (7.9) leads to 



(7.11) 



•) - IWtv < 2 1 - — 



We next write a general recursion relation 

f (K^'^\x,y) - f{y))dy 
Ja 



(K^{u,y) - f{y))dy {K{x,u) - f{u))du, 



and proof of (7.8) is established by induction (Problem 7.11). 
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If (7.7) does not hold, then the sets 



Dn 



|x; 



9(x) 



> n 



satisfy Pf(Dn) > 0 for every n. If x e Dn, then 
P{x, {x}) = 1 - Ej, 



. / f{y)g{x) ^ 

min < ■ . ' . ' , 1 
[g{Y)f{x) 



1_P [ZTkMi 

® V5(r) g{x)) ® [/(x)g(y) ?tF7<7rfT 



> 



^\9(Y) 

since Markov inequality implies that 






1 



fi^) 

'f{Y) 






1 

n* 



Consider a small set C such that Dn H is not empty for n large enough 
and xq G Dn H C^. The return time to C, satisfies 



Pxo (tc > A:) > 1 

n 



therefore, the radius of convergence of the series (in k,) is smaller than 

n/(n — 2) for every n, and this implies that cannot be geometrically 

ergodic, according to Theorem 6.75. □ 



This particular class of Metropolis-Hastings algorithms naturally suggests 
a comparison with Accept-Reject methods since every pair (/, ^) satisfying 
(7.7) can also induce an Accept-Reject algorithm. Note first that the expected 
acceptance probability for the variable simulated according to g is larger in 
the case of the algorithm [A. 2 5]. 

Lemma 7.9. If (7.7) holds, the expected acceptance probability associated with 
the algorithm [A. 25] is at least ^ when the chain is stationary. 

Proof. If the distribution of f{X)/g{X) is absolutely continuous,^ the ex- 
pected acceptance probability is 

^ This constraint implies that f /g is not constant over some interval. 
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■'}]=/ 

/ S(v)/W n^>s{v)dicdy 

= 2 / I/(vM£i>i f{x)g{y) dxdy 

J g(y)fix) — 

> 2 / f{x) dxdy 

J g(.y)-9M M 

^ 2 (f{Xi)^f{X,)\ 1 

M \g{Xi) - g{X2) J M' 

Since Xi and X 2 are independent and distributed according to /, this last 
probability is equal to 1/2, and the result follows. □ 

Thus, the independent Metropolis-Hastings algorithm [A.25] is more ef- 
ficient than the Accept-Reject algorithm [A.4] in its handling of the sample 
produced by g, since, on the average, it accepts more proposed values. A 
more advanced comparison between these approaches is about as difficult as 
the comparison between Accept-Reject and importance sampling proposed in 
Section 3.3.3, namely that the size of one of the two samples is random and 
this complicates the computation of the variance of the resulting estimator. 
In addition, the correlation between the X^’s resulting from [A.25] prohibits a 
closed-form expression of the joint distribution. We therefore study the con- 
sequence of the correlation on the variance of both estimators through an 
example. (See also Liu 1996b and Problem 7.33 for a comparison based on 
the eigenvalues of the transition operators in the discrete case, which also 
shows the advantage of the Metropolis-Hastings algorithm.) 

Example 7.10. Generating gamma variables. Using the algorithm of 
Example 2.19 (see also Example 3.15), an Accept-Reject method can be de- 
rived to generate random variables from the Qa{a,/3) distribution using a 
Gamma Qa{[a\ , b) candidate (where [aj denotes the integer part of a). When 
/? = 1, the optimal choice of b is 

b = [a\ / a . 

The algorithms to compare are then 
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^5000 iterations) 

Fig. 7.1. Convergence of Accept-Reject (solid line) and Metropolis-Hastings 
(dashed line) estimators to Ef[X^] = 8.33, for a — 2.43 based on the same se- 
quence 2 / 1 ,..., 2/5000 simulated from ^a(2, 2/2.43). The number of acceptances in 
[A. 27] is then random. The final values of the estimators are 8.25 for [A. 27] and 8.32 
for [A. 26]. 




Algorithm A.27 -Gamma Accept-Reject- 

1. Generate Y ^ ^ [aj/o) . 

2. Accept X = Y with probability [A.27] 



Note that (7.7) does apply in this particular case with exp(x/o;)/a: > ej a. 

A first comparison is based on a sample (2/1, ... , 2 /n), of fixed size n, gen- 
erated from Qa{[a\, W/o^) with generated from ^a(a, 1). The number 
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Fig. 7.2. Convergence to E/[A^] = 8.33 of Accept-Reject (full line) and 
Metropolis-Hastings (dots) estimators for 10,000 acceptances in [A. 27], the same 
sequence of t/^’s simulated from ^a(2, 2/2.43) being used in [A. 27] and [A. 26]. The 
final values of the estimators are 8.20 for [A. 27] and 8.21 for [A. 26]. 



t of values accepted by [A. 27] is then random. Figure 7.1 describes the con- 
vergence of the estimators of E/[X^] associated with both algorithms for the 
same sequence of y^’s and exhibits strong agreement between the ap>proaches, 
with the estimator based on [A. 26] being closer to the exact value 8.33 in this 
case. 

On the other hand, the number t of values accepted by [A. 27] can be fixed 
and [A. 26] can then use the resulting sample of random size n, yi, . . . , 2/n- Fig- 
ure 7.2 reproduces the comparison in this second case and exhibits a behavior 
rather similar to Figure 7.1, with another close agreement between estimators 
and, the scale being different, a smaller variance (which is due to the larger 
size of the eflFective sample). 

Note, however, that both comparisons are biased. In the first case, the 
sample of produced by [A. 27] does not have the distribution / and, in 
the second case, the sample of y^’s in [A. 26] is not iid. In both cases, this is due 
to the use of a stopping rule which modifies the distribution of the samples. || 



Example 7.11. Logistic Regression. We return to the data of Example 
1.13, which described a logistic regression relating the failure of 0-rings in 
shuttle flights to air temperature. We observe (xi^yi)^ i = 1, . . . , n according 
to the model 



y. ~ Berno«lli(p(*.)), p(x) = 

where p{x) is the probability of an 0-ring failure at temperature x. The like- 
lihood is 
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1 

1 + exp(o! + (3xi) 



i-yi 



n 

L{a,(3\y) oc JJ 

i=l 



exp(g + iSxj) 

1 + exp(g-h/3xi) 



Vi 



and we take the prior to be 

7ra{a\b)7Tf3{P) = ^e^e~^"^^dadl3, 

which puts an exponential prior on log a and a flat prior on /3, and insures 
propriety of the posterior distribution (Problem 7.25). To complete the prior 
speciflcation we must give a value for 6, and we choose the data-dependent 
value that makes Eg = g, where a is the MLE of g. (This also insures that the 
prior will not have undue influence, as it is now centered near the likelihood.) 
It can be shown that 

~ ^^doi — / \og{w)-e~'^^^dw = log(M — 7, 

Jo ^ Jo ^ 



where 7 is Euler’s Constant^ equal to .577216. Thus we take b = . 

The posterior distribution is proportional to L(g, /3|y)7r(g, /?), and to sim- 
ulate from this distribution we take an independent candidate 

g{a,p) = 7Tc,(g|5)(/>(^), 



where 0(/?) is a normal distribution with mean /3 and variance the MLEs. 
Note that although basing the prior distribution on the data is somewhat in 
violation of the formal Bayesian paradigm, nothing is violated if the candi- 
date depends on the data. In fact, this will usually result in a more effective 
simulation, as the candidate is placed close to the target. 

Generating a random variable from g{a,(3) is straightforward, as it only 
involves the generation of a normal and an exponential random variable. If we 
are at the point (go,/?o) in the Markov chain, and we generate (g',/?') from 
^(g,/3), we accept the candidate with probability 



r L{a',(3'\y) 0(/^q) 
\L(go,^o|y) </>(/?') 




Figure 7.3 shows the distribution of the generated parameters and their con- 
vergence. II 



Example 7.12. Saddlepoint tail area approximation. In Example 3.18, 
we saw an approximation to noncentral chi squared tail areas based on the 
regular and renormalized saddlepoint approximations. Such an approximation 
requires numerical integration, both to calculate the constant and to evaluate 
the tail area. 

An alternative is to produce a sample Zi, . . . , Zm, from the saddlepoint 
distribution, and then approximate the tail area using 
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O 2000 6000 10000 O 2000 6000 10000 

Intercept Slope 



Fig. 7.3. Estimation of the slope and intercept from the Challenger logistic regres- 
sion. The top panels show histograms of the distribution of the coefficients, while 
the bottom panels show the convergence of the means. 



^ 2 1 / 2 

P{X >a) = j ^ ^ (^) [if^(i)]'/"exp{n[/fx(i) 

- m 

(7.13) >f(a)], 
m 

where Kx{r) is the cumulant generating function of X and r{x) is the solution 
of the saddlepoint equation K'{r{x)) = x (see Section 3.6.2). 

Note that we are simulating from the transformed density. It is interesting 
(and useful) to note that we can easily derive an instrumental density to use 
in a Metropolis-Hastings algorithm. Using a Taylor series approximation, we 
find that 

(7.14) exp{n[ii'x(0 ~exp|-n7£:^(0)y| , 

SO a first choice for an instrumental density is the A/*(0, l/nK^(0)) distri- 
bution (see Problem 7.26 for details). Booth et al. (1999) use a Student’s t 
approximation instead. 

We can now simulate the noncentral chi squared tail areas using a normal 
instrumental density with = 2[p(l — 2t) H- 4A]/(1 — 2t)^. The results 

are presented in Table 7.1, where we see that the approximations are quite 
good. Note that the same set of simulated random variables can be used for 
all the tail area probability calculations. Moreover, by using the Metropolis- 






284 7 The Metropolis-Hastings Algorithm 



Hastings algorithm, we have avoided calculating the normalizing constant for 
the saddlepoint approximation. || 



Interval 


Renormalized Exact Monte Carlo 


(36.225, oo) 


0.0996 


0.1 


0.0992 


(40.542, cx)) 


0.0497 


0.05 


0.0497 


(49.333, oo) 


0.0099 


0.01 


0.0098 



Table 7.1. Monte Carlo saddlepoint approximation of a noncentral chi squared 
integral for p = 6 and A = 9, based on 10, 000 simulated random variables. 



As an aside, note that the usual classification of “Hastings” for the algo- 
rithm [A. 25] is somewhat inappropriate, since Hastings (1970) considers the 
algorithm [A. 24] in general, using random walks (Section 7.5) rather than in- 
dependent distributions in his examples. It is also interesting to recall that 
Hastings (1970) proposes a theoretical justification of these methods for finite 
state-space Markov chains based on the finite representation of real numbers 
in a computer. However, a complete justification of this physical discretiza- 
tion needs to take into account the effect of the approximation in the entire 
analysis. In particular, it needs to be verified that the computer choice of 
discrete approximation to the continuous distribution has no effect on the 
resulting stationary distribution or irreducibility of the chain. Since Hastings 
(1970) does not go into such detail, but keeps to the simulation level, we 
prefer to study the theoretical properties of these algorithms by bypassing 
the finite representation of numbers in a computer and by assuming fiawless 
pseudo-random generators, namely algorithms producing variables which are 
uniformly distributed on [0, 1]. See Roberts et al. (1995) for a theoretical study 
of some effects of the computer discretization. 

A final note about independent Metropolis-Hastings algorithms is that 
they cannot be omniscient: there are settings where an independent proposal 
does not work well because of the complexity of the target distribution. Since 
the main purpose of MCMC algorithms is to provide a crude but easy simu- 
lation technique, it is difficult to imagine spending a long time on the design 
of the proposal distribution. This is specially pertinent in high-dimensional 
models where the capture of the main features of the target distribution is 
most often impossible. There is therefore a limitation of the independent pro- 
posal, which can be perceived as a global proposal, and a need to use more 
local proposals that are not so sensitive to the target distribution, as presented 
in Section 7.5. Another possibility, developed in Section 7.6.3, is to validate 
adaptive algorithms that learn from the ongoing performances of the current 
proposals to refine their construction. But this solution is delicate, both from 
a theoretical ( ^^Does ergodicity apply?”) and an algorithmic ( ^^How does one 
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tune the adaptation?’^) point of view. The following section first develops a 
specific kind of adaptive algorithm. 

7.4.2 A Metropolis-Hastings Version of ARS 

The ARS algorithm, which provides a general Accept-Reject method for log- 
concave densities in dimension one (see Section 2.4.2), can be generalized to 
the ARMS method (which stands for Adaptive Rejection Metropolis Sampling) 
following the approach developed by Gilks et al. (1995). This generalization 
applies to the simulation of arbitrary densities, instead of being restricted 
to log-concave densities as the ARS algorithm, by simply adapting the ARS 
algorithm for densities / that are not log-concave. The algorithm progressively 
fits a function which plays the role of a pseudo-envelope of the density 
/. In general, this function g does not provide an upper bound on /, but 
the introduction of a Metropolis-Hastings step in the algorithm justifies the 
procedure. 

Using the notation from Section 2.4.2, take h{x) = \ogfi{x) with fi pro- 
portional to the density /. For a sample Sn = < i < n-hl}, the 

equations of the lines between {xi,h{xi)) and (x^+i, /i(xi+i)) are denoted by 
y = Li^iJ^i{x). Consider 



hn{x) = max{Lj,j+i(a;),mm[Li_i,i(x),Li+i,i+2(a;)]} , 



for Xi < X < Xi^i^ with 



hn{x)= Lo^i{x) 
hn{x) = max[Lo,i(x),Li,2(x)] 
hn{^)— max[Z/7T,,n+l (^)? -^n— l,n(^)] 

and hn{^)= I/n,n+l(^) 



if X < xo, 
if xo < X < xi, 

if Xn < X < Xn-fl, 
if X>Xn+l. 



The resulting proposal distribution is gn{x) oc exp{/in(x)}. The ARMS algo- 
rithm is based on gn and it can be decomposed into two parts, a first step 
which is a standard Accept-Reject step for the simulation from the instru- 
mental distribution 



V^n(^) 



(X mm 



fi{x),exp{hn{x)} 



based on gn^ and a second part, which is the acceptance of the simulated value 
by a Metropolis-Hastings procedure: 
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Algorithm A. 28 -ARMS Metropolis-Hastings- 



1 * Simulate Y from gn{y) and U W[0.1) 
until 

U<MY)/exp{hn{Y)}. 

2 . Generate V~W[o,i) and take [A.28] 



(7.15) 




if 



v< 



MY) 

/i(x(‘)) MY) 



Otherwise , 



The Accept-Reject step indeed produces a variable distributed from 'ipnix) 
and this justifies the expression of the acceptance probability in the Metro- 
polis-Hastings step. Note that [A.28] is a particular case of the approximate 
Accept-Reject algorithms considered by Tierney (1994) (see Problem 7.9). 
The probability (7.15) can also be written 



min 

< 

min 



^ /i(y)exp{/in(xW)} 
’ /i(a;(‘))exp{ft„(y)} 
exp{/i„(x(*))} 



if /i(y) > exp{/i„(F)}, 



Otherwise, 



which implies a sure acceptance of Y when /i(x^^^) < exp{hn{x^^^)}; that is, 
when the bound is correct. 

Each simulation of F ^ Qn in Step 1 of [A.28] provides, in addition, an 
update of Sn in S'n+i = SnU {y}, and therefore of gn^ when Y is rejected. As 
in the case of the ARS algorithm, the initial Sn set must be chosen so that gn 
is truly a probability density. If the support of / is not bounded from below, 
Lo,i must be increasing and, similarly, if the support of / is not bounded from 
above, Ln^n-\-i niust be decreasing. Note also that the simulation of gn detailed 
in Section 2.4.2 is valid in this setting. 

Since the algorithm [A.28] appears to be a particular case of indepen- 
dent Metropolis-Hastings algorithm, the convergence and ergodicity results 
obtained in Section 7.4 should apply for [A.28]. This is not the case, how- 
ever, because of the lack of time homogeneity of the chain (see Definition 6.4) 
produced by [A.28]. The transition kernel, based dn gn, can change at each 
step with a positive probability. Since the study of nonhomogeneous chains is 
quite delicate, the algorithm [A.28] can be justified only by reverting to the 
homogeneous case; that is, by fixing the function gn and the set Sn after a 
warm-up period of length no- The constant no need not be fixed in advance 
as this warm-up period can conclude when the approximation of fi by gn is 
satisfactory, for instance when the rejection rate in Step 1 of [A.28] is suffi- 
ciently small. The algorithm [A.28] must then start with an initializing (or 
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calibrating) step which adapts the parameters at hand (in this case, Qn) to 
the function /i. This adaptive structure is generalized in Section 7.(5. 3. 

The ARMS algorithm is useful when a precise analytical study of the 
density / is impossible, as, for instance, in the setup of generalized linear 
models. In fact, / (or /i) needs to be computed in only a few points to 
initialize the algorithm, which thus does not require the search for “good” 
density g which approximates /. This feature should be contrasted to the 
cases of the independent Metropolis-Hastings algorithm and of sufficiently 
fast random walks as in the case of [A. 29]. 



Example 7.13. Poisson logistic model. For the generalized linear model 
in Example 2.26, consider a logit dependence between explanatory and de- 
pendent (observations) variables. 



Yi\xi^V 



exp{bxi) 

1 -h exp{bxi) 



2 = l,...,n , 



which implies the restriction < 1 on the parameters of the Poisson distri- 
bution, Yi ~ V{Xi). When b has the prior distribution A/*(0, r^), the posterior 
distribution is 



7r(6|x,y) oc 



n^(l + exp{bxi)) I ^14- J 



This posterior distribution 7t(5|x) is not easy to simulate from and one can 
use the ARMS Metropolis-Hastings algorithm instead. || 



7.5 Random Walks 

A natural approach for the practical construction of a Metropolis-Hastings 
algorithm is to take into account the value previously simulated to generate the 
following value; that is, to consider a local exploration of the neighborhood of 
the current value of the Markov chain. This idea is already used in algorithms 
such as the simulated annealing algorithm [A. 19] and the stochastic gradient 
method given in (5.4). 

Since the candidate g in algorithm [A. 24] is allowed to depend on the 
current state a first choice to consider is to simulate Yt according to 

Yt=X^^^+St, 

where et is a random perturbation with distribution 5 , independent of In 
terms of the algorithm [A. 24], q{y\x) is now of the form g{y — x). The Markov 
chain associated with g is a random walk (see Example 6.39) on £. 
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(5 


0.1 


0.5 


1.0 


Mean 


0.399 


-0.111 0.10 


Variance 0.698 


1.11 


1.06 



Table 7.2. Estimators of the mean and the variance of a normal distribution A7(0, 1) 
based on a sample obtained by a Metropolis-Hastings algorithm using a random walk 
on (15,000 simulations). 



The convergence results of Section 7.3.2 naturally apply in this particular 
case. Following Lemma 7.6, if ^ is positive in a neighborhood of 0, the chain 
is /-irreducible and aperiodic, therefore ergodic. The most common 
distributions g in this setup are the uniform distributions on spheres centered 
at the origin or standard distributions like the normal and the Student’s t 
distributions. All these distributions usually need to be scaled; we discuss this 
problem in Section 7.6. At this point, we note that the choice of a symmetric 
function g (that is, such that g{—t) — g{t))^ leads to the following original 
expression of [A. 24], as proposed by Metropolis et al. (1953). 

Algorithm A. 29 —Random walk Metropolis-Hastings— 

Given , 

1. Generate Yt 

2. Take [A. 29] 

^(£+ 1 ) J with probability min 

otherwise. 




Example 7.14. A random walk normal generator. Hastings (1970) con- 
siders the generation of the normal distribution A/’(0, 1) based on the uni- 
form distribution on [—5,5]. The probability of acceptance is then p{x^^\yt) 
— exp{(a;^^^ — ^^)/2} A 1. Figure 7.4 describes three samples of 15, 000 points 
produced by this method for (5 = 0.1, 0.5, and 1. The corresponding estimates 
of the mean and variance are provided in Table 7.2. Figure 7.4 clearly shows 
the different speeds of convergence of the averages associated with these three 
values of 5, with an increasing regularity (in 6) of the corresponding histograms 
and a faster exploration of the support of /. || 

Despite its simplicity and its natural features, the random walk Metropo- 
lis-Hastings algorithm does not enjoy uniform ergodicity properties. Mengersen 
and Tweedie (1996) have shown that in the case where supp / = M, this algo- 
rithm cannot produce a uniformly ergodic Markov chain on R (Problem 7.16). 
This is a rather unsurprising feature when considering the local character of 
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Fig. 7.4. Histograms of three samples produced by the algorithm [^.29] for a 
random walk on [—(5,(5] with (a) S = 0.1, (b) 6 = 0.5, and (c) S = 1.0, with the 
convergence of the means (7.1), superimposed with scales on the right of the graphs 
(15,000 simulations). 



the random walk proposal, centered at the current value of the Markov chain. 

Although uniform ergodicity cannot be obtained with random v/alk Me- 
tropolis-Eastings algorithms, it is possible to derive necessary and sufficient 
conditions for geometric ergodicity. Mengersen and Tweedie (1996) have pro- 
posed a condition based on the log- concavity of f in the tails; that is, if there 
exist (3^ > 0 and xi such that 

(7.16) log /(a;) - log/(y) > a\y - x\ 
for y < X < —xi or xi < X < y. 

Theorem 7.15. Consider a symmetric density f which is log- concave with 
associated constant a in (7.16) for |x| large enough. If the density g is positive 
and symmetric, the chain of [A. 29] is geometrically ergodic. If f is 

not symmetric, a sufficient condition for geometric ergodicity is that g{t) be 
hounded by 5exp{— o;|t|} for a sufficiently large constant b. 

The proof of this result is based on the use of the drift function V (x) = 
exp{o!|x|/2} (see Note 6.9.1) and the verification of a geometric drift condition 
of the form 

(7.17) Z\F(x) < -AF(x) + , 

for a suitable bound x*. Mengersen and Tweedie (1996) have shown, in ad- 
dition, that this condition on g is also necessary in the sense that if (X^^^) is 
geometrically ergodic, there exists s > 0 such that 

J f{x)dx < oo . 



(7.18) 
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Fig. 7.5. 90% confidence envelopes of the means produced by the random walk 
Metropolis-Hastings algorithm [A. 24] based on a instrumental distribution A/^(0, 1) 
for the generation of (a) a normal distribution A/^(0, 1) and (b) a distribution with 
density These envelopes are derived from 500 parallel independent chains and 
with identical uniform samples on both distributions. 



Example 7.16. A comparison of tail effects. In order to assess the prac- 
tical effect of this theorem, Mengersen and Tweedie (1996) considered two 
random walk Metropolis-Hastings algorithms based on a A/’(0, 1) instrumen- 
tal distribution for the generation of (a) a AT(0, 1) distribution and (b) a 
distribution with density ip{x) cx: (1 -f |x|)“^. Applying Theorem 7.15 (see 
Problem 7.18), it can be shown that the first chain associated is geometrically 
ergodic, whereas the second chain is not. Figures 7.5(a) and 7.5(b) represent 
the average behavior of the sums 



f t 

t=i 

over 500 chains initialized at = 0. The 5% and 95% quantiles of these 
chains show a larger variability of the chain associated with the distribution -0, 
in terms of both width of the confidence region and precision of the resulting 
estimators. || 

We next look at a discrete example where Algorithm [A. 29] generates a 
geometrically ergodic chain. 

Example 7.17. Random walk geometric generation. Consider generat- 
ing a geometric^ distribution, Qeo{6) using [A. 29] with having transition 

probabilities q{i,j) = = i) given by 

^ The material used in the current example refers to the drift condition introduced 
in Note 6.9.1. 
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"l/2 i =j- l,j + 1 and j = 1,2,3, .. . 

Qihj) = < 1/2 I = 0, 1 and j = 0 
0 otherwise; 

that is, q is the transition kernel of a symmetric random walk on the non- 
negative integers with reflecting boundary at 0. 

Now, X ~ Qeo{6) implies P{X = x) = {1 — 9)^9 for x = 0, 1, 2, — The 
transition matrix has a band diagonal structure and is given by 




Consider the potential function V{i) = where /3 > 1, and recall that 
Z\y(2/(0)) For i > 0, we have 

=i] = l ^ /3* + 

^ 

= y(i) (^ + I + ■ 

Thus, AV{i) = y(i)(l/(2/3) + 6>/2 - 1 + (3{\ - 6)/2) = V{i)g{e,(5). For a 
fixed value of 9, g{9^(3) is minimized by /? = l/\/l — 9. In this case, AV{i) = 
(Vl — 9 -h 9/2 — l)V{i) and A = y/1 — 9 + 9/2 — 1 is the geometric rate of 
convergence. The closer 0 is to 1, the faster the convergence. || 

Tierney (1994) proposed a modification of the previous algorithm with a 
proposal density of the form g{y — a — b{x — a)); that is, 

yt = a + -a) + zt , Zt ^ g . 

This autoregressive representation can be seen as intermediary between the 
independent version (b — 0) and the random walk version (6 — 1) of the 
Metropolis-Hastings algorithm. Moreover, when 6 < 0, and are 

negatively correlated, and this may allow for faster excursions on the surface 
of / if the symmetry point a is well chosen. Hastings (1970) also considers 
an alternative to the uniform distribution on [x^^^ — 6, x^^^ + S] (see Example 
7.14) with the uniform distribution on [— — 6, —x^^^ + 6]: The convergence 
of the empirical average to 0 is then faster in this case, but the choice of 0 as 
center of symmetry is obviously crucial and requires some a priori information 
on the distribution /. In a general setting, a and b can be calibrated during 
the first iterations. (See also Problem 7.23.) (See also Chen and Schmeiser 
1993, 1998 for the alternative “hit-and-run” algorithm, which proceeds by 
generating a random direction in the space and moves the current value by a 
random distance along this direction.) 
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7.6 Optimization and Control 

The previous sections have established the theoretical validity of the Me- 
tropolis-Hastings algorithms by showing that under suitable (and not very 
restrictive) conditions on the transition kernel, the chain produced by [A. 24] 
is ergodic and, therefore, that the mean (7.1) converges to the expectation 
Ef[h{X)]. In Sections 7.4 and 7.4, however, we showed that the most common 
algorithms only rarely enjoy strong ergodicity properties (geometric or uni- 
form ergodicity). In particular, there are simple examples (see Problem 7.5) 
that show how slow convergence can be. 

This section addresses the problem of choosing the transition kernel q{y\x) 
and illustrates a general acceleration method for Metropolis-Hastings algo- 
rithms, which extends the conditioning techniques presented in Section 4.2. 



7.6.1 Optimizing the Acceptance Rate 

When considering only the classes of algorithms described in Section 7.4, the 
most common alternatives are to use the following: 

(a) a fully automated algorithm like ARMS ([A. 28]); 

(b) an instrumental density g which approximates /, such that f /g is bounded 
for uniform ergodicity to apply to the algorithm [A. 25]; 

(c) a random walk as in [A. 29]. 



In case (a), the automated feature of [A. 28] reduces “parameterization” to 
the choice of initial values, which are theoretically of limited influence on the 
efficiency of the algorithm. In both of the other cases, the choice of g is much 
more critical, as it determines the performances of the resulting Metropolis- 
Hastings algorithm. As we will see below, the few pieces of advice available on 
the choice of g are, in fact, contrary! Depending on the type of Metropolis- 
Hastings algorithm selected, one would want high acceptance rates in case (b) 
and low acceptance rates in case (c). 

Consider, first, the independent Metropolis-Hastings algorithm introduced 
in Section 7.4. Its similarity with the Accept-Reject algorithm suggests a 
choice of g that maximizes the average acceptance rate 



p = E 
= 2P 



■ . f fix) g(X) 
““(/(X) g{Yy 
(f{Y) f{X)\ 

\g{Y) - g{X)J' 



X^f,Y^g, 



as seen^ in Lemma 7.9. In fact, the optimization associated with the choice 
of g is related to the speed of convergence of ^ to Kf[h{X)] 



^ Under the same assumption of no point mass for the ratio f{Y)/g{Y). 
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and, therefore, to the ability of the algorithm [A. 25] to quickly explore any 
complexity of / (see, for example. Theorem 7.8). 

If this optimization is to be generic (that is, independent of /i), g should 
reproduce the density / as faithfully as possible, which implies the maximiza- 
tion of p. For example, a density g that is either much less or much more 
concentrated, compared with /, produces a ratio 

fjy) gjx) . 
fix) g{y) 

having huge variations and, therefore, leads to a low acceptance rate. 

The acceptance rate p is typically impossible to compute, and one solution 
is to use the minorization result p > 1/M of Lemma 7.9 to minimize M as in 
the case of the Accept-Reject algorithm. 

Alternatively, we can consider a more empirical approach that consists of 
choosing a parameterized instrumental distribution g{'\6) and adjusting the 
corresponding parameters 6 based on the evaluated acceptance rate, now p{6)\ 
that is, first choose an initial value for the parameters, and estimate the 
corresponding acceptance rate, p{0o), based on m iterations of [A. 25], then 
modify 6q to obtain an increase in p. 

In the simplest cases, will reduce to a scale parameter which is increased 
or decreased depending on the behavior of p{9). In multidimensional settings, 
^0 can also include a position parameter or a matrix acting as a scale param- 
eter, which makes optimizing p{9) a more complex task. Note that p{9) can 
be obtained by simply counting acceptances or through 



2 

m 



m 

^{fiyi)9{xi\e)>f{xi)g{yi\e)} 5 

i=l 



where is a sample from /, obtained, for instance, from a first 

MCMC algorithm, and yi, . . . , is an iid sample from g{-\9). Therefore, if 9 is 
composed of location and scale parameters, a sample ((xi, ?/i), . . . , (xm, Vm)) 
corresponding to a value 9q can be used repeatedly to evaluate different values 
of 0 by a deterministic modification of which facilitates the maxiruization 
of p{6). 

Example 7.18. Inverse Gaussian distribution. The inverse Gaussian 
distribution has the density 

(7.19) f{z\9i,92) (X exp ^ -f 2\/0^ -h log 

on R+ {9i >0,02 > 0). Denoting '0(0i,02) = 2^/0i^ -h log \/202, it follows 
from a classical result on exponential families (see Brown 1986, Chapter 2, 
Robert 2001, Lemma 3.3.7, or Problem 1.38) that 
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0.2 


0.5 


0.8 


0.9 


1 


1.1 


1.2 


1.5 


HP) 


0.22 


0.41 


0.54 


0.56 


0.60 


0.63 


0.64 


0.71 


E[Z] 


1.137 


1.158 


1.164 


1.154 


1.133 


1.148 


1.181 


1.148 


E[l/Z] 


1.116 


1.108 


1.116 


1.115 


1.120 


1.126 


1.095 


1.115 



Table 7.3. Estimation of the means of Z and of 1 /Z for the inverse Gaussian 
distribution XN{9\,02) by the Metropolis-Hastings algorithm [A. 25] and evaluation 
of the acceptance rate for the instrumental distribution Qa{^ 02 l 9 i (3^(3) {6i = 

1.5, 6>2 = 2 , and m = 5000). 



E[(Z,1/Z)] = 





A possible choice for the simulation of (7.19) is the Gamma distribution 
Qa{a,(3) in algorithm [A. 25], taking a = so that the means of both 

distributions coincide. Since 

the ratio f j g is bounded for (3 < Oi. The value of x which maximizes the ratio 
is the solution of 



{(3 - 6>i)x^ - (q;+-)x + ^2 = 0; 



that is, 



Xa = 



(g + 1/2) - V(q + 1/2)2 + 



2(/3-0i) 

The analytical optimization (in j3) of 



M(/?) = (x^) ^ 1/2 exp|(/?-6>i)x^ - 

is not possible, although, in this specific case the curve M(/?) can be plotted 
for given values of 9i and ^2 and the optimal value [3'^ can be approximated 
numerically. Typically, the infiuence of the choice of (3 must be assessed em- 
pirically; that is, by approximating the acceptance rate p via the method 
described above. 

Note that a new sample (^ 1 , . . . , t/m) must be simulated for every new 
value of (3. Whereas y ~ Qa{a,P) is equivalent to f3y ~ Qa{a, 1), the factor a 
depends on j3 and it is not possible to use the same sample for several values 
of (3. Table 7.3 provides an evaluation of the rate p as a function of f3 and gives 
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estimates of the means of Z and \ jZ for 6 i = 1.5 and 62 = 2. The constraint 
on the ratio f /g then imposes (3 < 1.5. The corresponding theoretical values 
are respectively 1.155 and 1.116, and the optimal value of ^ is = 1.5. || 



The random walk version of the Metropolis-Hastings algorithm, introduced 
in Section 7.5, requires a different approach to acceptance rates, given the 
dependence of the instrumental distribution on the current state of the chain. 
In fact, a high acceptance rate does not necessarily indicate that the algorithm 
is moving correctly since it may indicate that the random walk is moving too 
slowly on the surface of /. If and yt are close, in the sense that and 

f{yt) are approximately equal, the algorithm [A. 29] leads to the acceptance 
of y with probability 



min 



fivt) ' 



^ 1 . 



A higher acceptance rate may therefore correspond to a slower convergence 
as the moves on the support of / are more limited. In the particular case 
of multimodal densities whose modes are separated by zones of extremely 
small probability, the negative effect of limited moves on the surface of f 
clearly shows. While the acceptance rate is quite high for a distribution g 
with small variance, the probability of jumping from one mode to another 
may be arbitrarily small. This phenomenon occurs, for instance, in the case of 
mixtures of distributions (see Section 9.7.1) and in overparameterized models 
(see, e.g.. Tanner and Wong 1987 and Besag et al. 1995). In contrast, if the 
average acceptance rate is low, the successive values of f{yt) tend to be small 
compared with which means that the random walk moves quickly on 

the surface of / since it often reaches the “borders” of the support of / (or, at 
least, that the random walk explores regions with low probability under /). 

The above analysis seems to require an advanced knowledge of the den- 
sity of interest, since an instrumental distribution g with too narrow a range 
will slow down the convergence rate of the algorithm. On the other hand, a 
distribution g with a wide range results in a waste of simulations of points 
outside the range of / without improving the probability of visiting all of the 
modes of /. It is unfortunate that an automated parameterization of g cannot 
guarantee uniformly optimal performances for the algorithm [A. 29], and that 
the rules for choosing the rate presented in Note 7.8.4 are only heuristic. 



7.6.2 Conditioning and Accelerations 

Similar® to the Accept-Reject method, the Metropolis-Hastings algorithm 
does not take advantage of the total set of random variables that are gener- 
ated. Lemma 7.9 shows that the “rate of waste” of these variables yt is lower 
than for the Accept-Reject method, but it still seems inefficient to ignore the 

® This section presents material related to nonparametric Rao-Blackwellization, as 
in Section 4.2, and may be skipped on a first reading. 
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rejected As the rejection mechanism relies on an independent uniform 
random variable, it is reasonable to expect that the rejected variables bring, 
although indirectly, some relevant information on the distribution /. As in 
the conditioning method introduced in Section 4.2, the Rao-Blackwellization 
technique applies in the case of the Metropolis-Hastings algorithm. (Other 
approaches to Metropolis-Hastings acceleration can be found in Green and 
Han 1992, Gelfand and Sahu 1994, or McKeague and Wefelmeyer 2000.) 

First, note that a sample produced by the Metropolis-Hastings algorithm, 
. . . , x^'^\ is based on two samples, yi, . . . ,yr and ui, . . . , ut, with yt ^ 
q{y\x^^~^^) and Ut ~ ^o,i]- The mean (7.1) can then be written 



T t 






t=l 



t=l i=l 
T T 









t=l 



i—t 



and the conditional expectation 






t=i 






i=t 






^ T / T ^ 

t=i \i=t y 



dominates the empirical mean, under quadratic loss. This is a conse- 

quence of the Rao-Blackwell Theorem (see Lehmann and Casella 1998, Section 
1.7), resulting from the fact that integrates out the variation due to the 
uniform sample. 

The practical interest of this alternative to is that the probabili- 
ties = yt\yi, . . . ,yr) can be explicitly computed. Casella and Robert 

(1996) have established the two following results, which provide the weights 
for h{yt) in 6^^ both for the independent Metropolis-Hastings algorithm and 
the general Metropolis-Hastings algorithm. In both cases, the computational 
complexity of these weights is of order O(T^), which is a manageable order of 
magnitude. 

Consider first the case of the independent Metropolis-Hastings algorithm 
associated with the instrumental distribution g. For simplicity’s sake, assume 
that is simulated according to the distribution of interest, /, so that the 
chain is stationary, and the mean (7.1) can be written 






T + 



t=0 



.(^)l 



with x^^^ =2/0- If we denote 
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n 10 25 50 100 

hi 50.11 49.39 48.27 46.68 
h2 42.20 44.75 45.44 44.57 

Table 7.4. Decrease (in percentage) of squared error risk associated with 5^^ for 
the evaluation of E[/ii(X)], evaluated over 7500 simulations for different sample sizes 
n. {Source: Casella and Robert 1996). 




The computation of Cij for a fixed i requires {T — i) multiplications since 
Ci(j+i) = therefore, the computation of all the C^j’s require 

T(T+ l)/2 multiplications. The derivations of ti and (fi are of the same order 
of complexity. 

Example 7.20. Rao— Black wellizat ion improvement for a Ts simula- 
tion. Suppose the target distribution is % and the instrumental distribution 
is Cauchy, C(0, 1). The ratio f/g is bounded, which ensures a geometric rate 
of convergence for the associated Metropolis-Eastings algorithm. Table 7.4 
illustrates the improvement brought by 6^^ for some functions of interest 
hi{x) = X and h 2 {x) = I(i.96,+c5o)(^)5 whose (exact) expectations E[hi{X)] 
{i = 1,2) are 0 and 0.07, respectively. Over the different sample sizes selected 
for the experiment, the improvement in mean square error brought by is 
of the order 50%. || 
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We next consider the general case, with an arbitrary instrumental distri- 
bution q{y\x). The dependence between Yi and the set of previous variables 
^ U ^ (since can be equal to Iq, or Fi_i) complicates the 

expression of the joint distribution of Yi and Ui^ which cannot be obtained in 
closed form for arbitrary n. In fact, although is a Markov chain, (Yi) 

is not. 

Let us denote 



f{yi)lq{yi\yj) 

'pij = Pijy{yj+i\yj), 

t 

Cjj = ij Cjt = U pji 

;=j+i 



{j > *)- 

{I - Pij)q{yj+i\yi) 

(i<3< T), 



{i<j< T), 



To = 1 , 
wir = 1, 






Tj — ^2 '^*^* 0 - 1 ) Ptj’ Tt 



t=0 



= PjM+1 +PtM- 



tri+1 



T-1 

Y. TtCt{T-l)PtT {i < T), 

t=0 

{0<j<i< T). 



Casella (1996) derives the following expression for the weights of h{yi) in . 
We again leave the proof as a problem (Problem 7.32). 

Theorem 7.21. The estimator 6^^ satisfies 



^RB 



Ti h{yi) 

^T-1 > ’ 

Z^i=0 Q(T-1) 



with {i < T) 



and (fT = XT- 



Ti = Xi 



T-1 

+ Cz(T-i)(l - Pit) 

j=i 



Although these estimators are more complex than in the independent case, 
the complexity of the weights is again of order O(T^) since the computations of 
, p. ., (ij^ Ti, and Lo) involve T(T-|- 1)/2 multiplications. Casella and Robert 
(1996) give algorithmic advice toward easier and faster implementation. 

Example 7.22. (Continuation of Example 7.20) Consider now the sim- 
ulation of a 7s distribution based on a random walk with perturbations dis- 
tributed as C(0,cr^). The choice of a determines the acceptance rate for the 
Metropolis-Hastings algorithm: When a = 0.4, it is about 0.33, and when 
a = 3.0, it increases to 0.75. 

As explained in Section 7.6.1, the choice a = 0.4 is undoubtedly preferable 
in terms of efficiency of the algorithm. Table 7.5 confirms this argument. 
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n 


10 


25 


50 


100 


hi 


10.7 


8.8 


7.7 


7.7 


a = 0.4 


(1.52) 


(0.98) 


(0.63) 


(0.3) 


h2 


23.6 


25.2 


25.8 


25.0 




(0.02) 


(0.01) 


(0.006) 


(0.003) 


hi 


0.18 


0.15 


0.11 


0.07 


II 

CO 

o 


(2.28) 


(1.77) 


(1.31) 


(0.87) 


h2 


0.99 


0.94 


0.71 


1.19 




(0.03) 


(0.02) 


(0.014) 


(0.008) 



Table 7.5. Improvement brought by (in %) and quadratic risk of the empirical 
average (in parentheses) for different sample sizes and 50, 000 simulations of the 
random walk based on C(0, (Source: Casella and Robert 1996). 



since the quadratic risk of the estimators (7.1) is larger for cr = 3.0. The 
gains brought by are smaller, compared with the independent ca.se. They 
amount to approximately 8% and 25% for a = 0.4 and 0.1% and 1% for 
cr = 3. Casella and Robert (1996) consider an additional comparison with an 
importance sampling estimator based on the same sample yi, . . . , yn- 1 1 



7.6.3 Adaptive Schemes 

Given the range of situations where MCMC applies, it is unrealistic to hope 
for a generic MCMC sampler that would function in every possible setting. 
The more generic proposals like random walk Metropolis-Hastings algorithms 
are known to fail in large dimension and disconnected supports, because they 
take too long to explore the space of interest (Neal 2003). The reason for 
this impossibility theorem is that, in realistic problems, the complexity of 
the distribution to simulation is the very reason why MCMC is used! So it 
is difficult to ask for a prior opinion about this distribution, its support or 
the parameters of the proposal distribution used in the MCMC algorithm: 
intuition is close to void in most of these problems. 

However, the performances of off-the-shelf algorithms like the random walk 
Metropolis-Hastings scheme bring information about the distribution of in- 
terest and, thus, should be incorporated in the design of better and more 
powerful algorithms. The problem is that we usually miss the time to train 
the algorithm on these previous performances and are looking for the Holy 
Grail of automated MCMC procedures! While it is natural to think that the 
information brought by the first steps of an MCMC algorithm should be used 
in later steps, there is a severe catch: using the whole past of the “chain” 
implies that this is not a Markov chain any longer. Therefore, usual conver- 
gence theorems do not apply and the validity of the corresponding algorithms 
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is questionable. Further, it may be that, in practice, such algorithms do de- 
generate to point masses because of a too-rapid decrease in the variation of 
their proposal. 

Example 7.23. t-distribut ion. Consider a T(i/, (9, 1) sample (xi,...,Xn) 
with u known. Assume in addition a flat prior 7 t(^) = 1 on ^ as in a noninfor- 
mat ive environment. While the posterior distribution can be easily computed 
at an arbitrary value of 0, direct simulation and computation from this poste- 
rior is impossible. In a Metropolis-Hastings framework, we could fit a normal 
proposal from the empirical mean and variance of the previous values of the 
chain, 

Aif = 1 y] and 1 ■ 

2=1 2=1 

Notwithstanding the dependence on the past, we could then use the Metropolis- 
Hastings acceptance probability 

-1^ exp -(/it - 

l\[ J exp-(M,-0V2a? ’ 

where ^ is the proposed value from The invalidity of this scheme 

(because of the dependence on the whole sequence of till iteration t) is 
illustrated in Figure 7.6: when the range of the initial values is too small, the 
sequence of 0^'^^ ’s cannot converge to the target distribution and concentrates 
on too small a support. But the problem is deeper, because even when the 
range of the simulated values is correct, the (long-term) dependence on past 
values modifies the distribution of the sequence. Figure 7.7 shows that, for 
an initial variance of 2.5, there is a bias in the histogram, even after 25, 000 
iterations and stabilization of the empirical mean and variance. || 

Even though the Markov chain is converging in distribution to the target 
distribution (when using a proper, i.e., time- homogeneous, updating scheme), 
using past simulations to create a nonparametric approximation to the target 
distribution does not work either. For instance. Figure 7.8 shows the output 
of an adaptive scheme in the setting of Example 7.23 when the proposal 
distribution is the Gaussian kernel based on earlier simulations. A very large 
number of iterations is not sufficient to reach an acceptable approximation of 
the target distribution. 

The overall message is thus that one should not constantly adapt the pro- 
posal distribution on the past performances of the simulated chain. Either the 
adaptation must cease after a period of burn in (not to be taken into account 
for the computations of expectations and quantities related to the target dis- 
tribution), or the adaptive scheme must be theoretically assessed in its own 
right. This latter path is not easy and only a few examples can be found (so 
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Fig. 7.6. Output of the adaptive scheme for the t-distribution posterior with a 
sample of 10 xj ~ T^ and initial variances of (top) 0.1, (middle) 0.5, and (bottom) 
2.5. The left column plots the sequence of while the right column compares 
its histogram against the true posterior distribution (with a different scale for the 
upper graph). 




Fig. 7.7. Comparison of the distribution of an adaptive scheme sample of 25,000 
points with initial variance of 2.5 and of the target distribution. 
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Fig. 7.8. Sample produced by 50, 000 iterations of a nonparametric adaptive 
MCMC scheme and comparison of its distribution with the target distribution. 



far) in the literature. See, e.g., Gilks et al. (1998), who use regeneration to cre- 
ate block independence and preserve Markovianity on the paths rather than 
on the values (see also Sahu and Zhigljavsky 1998 and Holden 1998), Haario 
et al. (1999, 2001), who derive a proper^ adaptation scheme in the spirit of 
Example 7.23 by using a Gaussian proposal and a ridge-like correction to the 
empirical variance. 



Et = sCov{e^^\...,e^*^) + seId, 

where s and e are constant, and Andrieu and Robert (2001), who propose a 
more general framework of valid adaptivity based on stochastic optimization 
and the Robbin-Monro algorithm. (The latter actually embeds the chain of 
interest in a larger chain that also includes the parameter of the 

proposal distribution as well as the gradient of a performance criterion.) We 
will again consider adaptive algorithms in Chapter 14, with more accessible 
theoretical justifications. 



7.7 Problems 



7.1 Calculate the mean of a Gamma(4.3, 6.2) random variable using 

(a) Accept-Reject with a Gamma(4, 7) candidate. 

(b) Metropolis-Hastings with a Gamma(4, 7) candidate. 

(c) Metropolis-Hastings with a Gamma(5, 6) candidate. 

^ Since the chain is not Markov, the authors need to derive an ergodic theorem on 
their own. 
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In each case monitor the convergence. 

7.2 Student’s 77 density with v degrees of freedom is given by 






r(f) 




(u+l)/2 



Calculate the mean of a t distribution with 4 degrees of freedom using a 
Metropolis-Hastings algorithm with candidate density 

(a) N(0,1) 

(b) t with 2 degrees of freedom. 

Monitor the convergence of each. 

7.3 Complete some details of Theorem 7.2: 

(a) To establish (7.2), show that 



K{x,A) = P{Xt+ieA\Xt=x) 

= P{¥ e A and Xt+i = T |Xt = x) + P{x e A and Xt+i = x\Xt 

= / q{y\x)g{x,y)dy + / l{x e A){1 - g{x,y))q{y\x)dy, 

J A Jy 



where q{y\x) is the instrumental density and g{x, y) = P{Xt+i = y\Xt = x). 
Take the limiting case A = {y} to establish (7.2). 

(b) Establish (7.3). Notice that Sy{x)f{y) = Sx(y)f{x). 

7.4 For the transition kernel, 






gives sufficient conditions on p and r for the stationary distribution 7i to exist. 
Show that, in this case, tt is a normal distribution and that (7.4) occurs. 

7.5 (Doukhan et al. 1994) The algorithm presented in this problem is used in Chap- 
ter 12 as a benchmark for slow convergence. 

(a) Prove the following result: 



Lemma 7.24. Consider a probability density g on [0, 1] and a function 0 < 
p < 1 such that 



f 



9{x) 



/o 1 - P(x) 
The Markov chain with transition kernel 



dx < oo . 



K{x,x) = p{x) 6x{x') + (1 - p{x)) g{x) , 
where Sx is the Dirac mass at x, has stationary distribution 
f{x) (xg{x)/{l- p{x)). 

(b) Show that an algorithm for generating the Markov chain associated with 
Lemma 7.24 is given by 
Algorithm A. 30 —Repeat or Simulate— 

1. Take with probability 

2. Else, generate A' [^4,30] 
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(c) Highlight the similarity with the Accept-Reject algorithm and discuss in 
which sense they are complementary. 

7.6 (Continuation of Problem 7.5) Implement the algorithm of Problem 7.5 when 
g is the density of the Be{a -|- 1,1) distribution and p{x) = 1 — x. Give the 
expression of the stationary distribution /. Study the acceptance rate as a varies 
around 1. {Note: Doukhan et al. 1994 use this example to derive /3-mixing chains 
which do not satisfy the Central Limit Theorem.) 

7.7 (Continuation of Problem 7.5) Compare the algorithm [A. 30] with the corre- 
sponding Metropolis-Hastings algorithm; that is, the algorithm [A. 25] associ- 
ated with the same pair {f^g)- {Hint: Take into account the fact that [A. 30] 
simulates only the yts which are not discarded and compare the computing 
times when a recycling version as in Section 7.6.2 is implemented.) 

7.8 Determine the distribution of Yt given yt-i, ... in [A. 25]. 

7.9 (Tierney 1994) Consider a version of [A. 25] based on a “bound” M on f /g that 
is not a uniform bound; that is, f{x)/g{x) > M for some x. 

(a) If an Accept-Reject algorithm uses the density g with acceptance probabil- 
ity f{y)/Mg{y), show that the resulting variables are generated from 

f{x) oc min{f{x),Mg(x)} , 

instead of /. 

(b) Show that this error can be corrected, for instance by using the Metropolis- 
Hastings algorithm: 



1, Generate Yt ^ f . 








2. Accept with probability 




min 


r, 1 


if > M 


II 

II 


1 1 
. I 


L ’ / 

f I 


9iVt) 


1 


[mmj 


' 1 , . ^ ^ \ 

[ ’ / 


otherwise . 



to produce a sample from /. 



7.10 The inequality (7.8) can also be established using Orey’s inequality (See Prob- 
lem 6.42 for a slightly different formulation) For two transitions P and Q, 

IIP" - Q"||tV < 2P{Xn ^ Tn), Xn - P", Tn - Q". 

Deduce that when P is associated with the stationary distribution / and when 
Xn is generated by [A. 25], under the condition (7.7), 

I|P"-/||tv < (i-^) . 

Hint: Use a coupling argument based on 



A 



n 




9{z) - 
1 - 1/M 



with probability 1/M 
otherwise. 



7.11 Complete the proof of Theorem 7.8: 
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(a) Verify (7.10) and prove (7.11). {Hint: By (7.9), the inner integral is immedi- 
ately bounded by 1 — Then repeat the argument for the outer integral.) 

(b) Verify (7.12) and prove (7.8). 

7.12 In the setup of Hastings (1970) uniform- normal example (see Example 7.14): 

(a) Study the convergence rate (represented by the 90% interquantile range) 
and the acceptance rate when S increases. 

(b) Determine the value of 6 which minimizes the variance of the empirical 
average. {Hint: Use a simulation experiment.) 

7.13 Show that, for an arbitrary Metropolis-Hastings algorithm, every compact set 
is a small set when / and q are positive and continuous everywhere. 

7.14 (Mengersen and Tweedie 1996) With respect to Theorem 7.15, define 



= {y; f{x) < f{y)} and Bx = {y, f{x) > f{y)}. 



(a) If / is symmetric, show that Ax = {\y\ < |x|} for |x| larger than a value xq. 

(b) Define xi as the value after which / is log-concave and x* = xq\/ xi. For 
V{x) = exp5|x| and s < a, show that 

E[V(Xi)|a;o=a:] ^ ^ ^ 



V{x) 



g(x-y)dy 
+ C" - ll g{x - y)dy 

J X 

noo 

+ 2 g{y)dy . 

J X 



(c) Show that 



£ - 1 + - e-“«) g(y)dy 

= -£[l-e~^y] [l-e-^--^^^]g{y)dy 



and deduce that (7.17) holds for x > x* and x* large enough, 

(d) For X < X*, show that 



E[V{Xi)\xo = x] 
V{x) 



< 1+2 



poo px* 

/ g{y)dy + / g{z)dz 

J X* Jo 



and thus establish the theorem. 

7.15 Examine whether the following distributions are log-concave in the tails: Nor- 
mal, log-normal. Gamma, Student’s t, Pareto, Weibull. 

7.16 The following theorem is due to Mengersen and Tweedie (1996). 

Theorem 7.25. If the support of f is not compact and if g is symmetric, the 
chain (X^^^) produced by [A. 29] is not uniformly ergodic. 

Assume that the chain satisfies Doeblin’s condition (Theorem 6.59). 

(a) Take xq and Aq =] — oo, xq] such that u{Ao) > £ and consider the unilateral 
version of the random walk, with kernel 

K-{x,A) = Iia{x)+ [ 

^ J An] — oo,x] 



g{x -y)dy; 
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that is, the random walk which only goes to the left. Show that for y > xo^ 

P"^{y,Ao) < Py{r <m)< Py(r~ < m), 

where r and r~ are the return times to Ao for the chain and for K~ , 

respectively, 

(b) For y sufficiently large to satisfy 

{K-r{y,Ao) = - oo,:ro -j/]) < - , 

m 

show that 

m 

Py{T- <m)<Y^ {K-y(y,Ao} < m{K-r{y,Ao) , 

contradicting Doeblin’s condition and proving the theorem. 

(c) Formulate a version of Theorem 7.25 for higher-dimensional chains. 

7.17 Mengersen and Tweedie (1996) also establish the following theorem: 

Theorem 7.26. If g is continuous and satisfies 

j \x\ g{x)dx < oo , 

the chain of [A. 29] is geometrically ergodic if and only if 

^ = lim log/(x) < 0 . 

cc— >oo ax 

(a) To establish sufficiency, show that for x < y large enough, we have 

log f{y) - log f{x) = J ^ log f{t)dt < I (y - x). 

Deduce that this inequality ensures the log-concavity of / and, therefore, 
the application of Theorem 7.15. 

(b) For necessity, suppose ^ = 0. For every (5 > 0, show that you can choose x 
large enough so that \ogf{x + z) - log/(x) > -6z, z > 0 and, therefore, 
f{x A z) exp(^2:) > f{x). By integrating out the 2:’s, show that 

poo 

/ f(y) dy = oo, 

J X 

contradicting condition (7.18). 

7.18 For the situation of Example 7.16, show that 

lim log<^(x) = — oo and lim logihix) = 0 , 
x^oo dx x^oo dx 

showing that the chain associated with (p is geometrically ergodic and the chain 
associated with ip is not. 

7.19 Verify that the transition matrix associated with the geometric random walk 

in Example 7.17 is correct and that j3 — minimizes A/3. 
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7.20 The Institute for Child Health Policy at the University of Florida studies the 
effects of health policy decisions on children’s health. A small portion of one of 
their studies follows. 

The overall health of a child (metq) is rated on a 1-3 scale, with 3 being the 
worst. Each child is in an HMO® (variable np, l=nonprofit, — l=for profit). The 
dependent variable of interest {yij) is the use of an emergency rooEQ (erodds, 
loused emergency room, 0=did not). The question of interest is whether the 
status of the HMO affects the emergency room choice. 

(a) An appropriate model is the logistic regression model, 

logit(pij) — a + -h czij, z = 1, . . . , /c, j = 1, . . . , rit , 

where xi is the HMO type, Zij is the health status of the child, and pij 

is the probability of using an emergency room. Verify that the likelihood 
function is 

T“T y-y / exp(a + bxi + czij) \ f 1 \ 

\ 1 + exp(a + bxi + czij ) J \ 1 + exp(a + bxi + czij ) J 

(Here we are only distinguishing between for-profit and non-profit, so k = 

2 .) 

(b) Run a standard GLM on these data® and get the estimated mean and 
variance of a, 6, and c. 

(c) Use normal candidate densities with mean and variance at the GLM esti- 
mates in a Metropolis-Hastings algorithm that samples from the likelihood. 
Get histograms of the parameter values. 



X 


4 


7 


8 


9 


10 


11 


12 


y 


2,10 


4,22 


16 


10 


18,26 

34 


17,28 


14,20 

24,28 


X 


13 


14 


15 


16 


17 


18 


19 


y 


26,34 

34,46 


26,36 

60,80 


20,26 

54 


32,40 


32,40 

50 


42,56 

76,84 


36,46 

68 


X 


20 


22 


23 


24 


25 






y 


32,48 

52,56,64 


66 


54 


70,92 

93,120 


85 







Table 7.6. Braking distances of 50 cars, x — speed (mph), y = distance to stop 
(feet). 



7.21 The famous “braking data” of Tukey (1977) is given in Table 7.6. It is thought 
that a good model for this dataset is a quadratic model 

^ A person who joins an HMO (for Health Maintenance Organization) obtains their 
medical care through physicians belonging to the HMO. 

® Available as LogisticData.txt on the book website. 
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Vij = a + + cXi + i = 1, . . . , /c, j = 1, . . . m. 

If we assume that Sij ~ N(0,a‘^), independent, then the likelihood function is 



N/2 



g2^2 



where N = rii . We can view this likelihood function as a posterior distribu- 
tion of and cr^, and we can sample from it with a Metropolis-Hastings 

algorithm. 

(a) Get estimates of a, b, c, and from a usual linear regression. 

(b) Use the estimates to select a candidate distribution. Take normals for a, 6, c 
and inverted gamma for cr^. 

(c) Make histograms of the posterior distributions of the parameters. Monitor 
convergence. 

(d) Robustness considerations could lead to using an error distribution with 
heavier tails. If we assume that Sij ^ t(0,a^), independent, then the likeli- 
hood function is 



N/2 



nb+ 






bxi 



2 \ 2 \ -(t^+l)/2 



where u is the degrees of freedom. For = 4, use Metropolis-Hastings to 
sample a, b, c, and from this posterior distribution. Use either normal or 
t candidates for a, 6, c, and either inverted gamma or half-t for . 

{Note: See Problem 11.4 for another analysis of this dataset.) 

7.22 The traveling salesman problem is a classic in combinatoric and operations 
research, where a salesman has to find the shortest route to visit each of his N 
customers. 

(a) Show that the problem can be described by (i) a permutation a on {!,..., 
N} and (ii) a distance d{i,j) on {!,..., N}. 

(b) Deduce that the traveling salesman problem is equivalent to minimization 
of the function 



(c) Propose a Metropolis-Hastings algorithm to solve the problem with a sim- 
ulated annealing scheme (Section 5.2.3). 

(d) Derive a simulation approach to the solution of Ax = b and discuss its 
merits. 

7.23 Check whether a negative coefficient b in the random walk Yt = a b{X^^^ — 
a) + Zt induces a negative correlation between the Extend to the case 

where the random walk has an ARCH-like structure. 



Yt=a + - a) + exp(c + - a/)Zt. 

7.24 Implement the Metropolis-Hastings algorithm when / is the normal A/*(0, 1) 
density and q(-\x) is the uniform U[—x — J, —x + 6] density. Check for negative 
correlation between the A^^^’s when 6 varies. 

7.25 Referring to Example 7.11 

(a) Verify that log a has an exponential distribution. 
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(b) Show that the posterior distribution is proper, that is 

J L{a^ P\y)7r{a, /3)dadj3 < oo . 



(c) Show that Ea = log 6 — 7, where 7 is Euler’s constant. 

7.26 Referring to the situation of Example 7.12: 

(a) Use Taylor series to establish the approximations 

Kx{t) ^ Kx{0) + K'x{0)t + K'Jc(0)ty2 

tK'x(t) ^ t[K'x{0) + K'JciO)t] 

and hence (7.14). 

(b) Write out the Metropolis-Hastings algorithm that will produce random vari- 
ables from the saddlepoint distribution. 

(c) Apply the Metropolis saddlepoint approximation to the noncentral chi 
squared distribution and reproduce the tail probabilities in Table 7.1. 

7.27 Given a Cauchy C(0, a) instrumental distribution: 

(a) Experimentally select a to maximize (i) the acceptance rate when simulating 
a A/^(0, 1) distribution and (ii) the squared error when estimating the mean 
(equal to 0). 

(b) Same as (a), but when the instrumental distribution is 

7.28 Show that the Rao-Blackwellized estimator does not depend on the nor- 
malizing factors in / and g. 

7.29 Reproduce the experiment of Example 7.20 in the case of a Student’s 7? dis- 
tribution. 

7.30 In the setup of the Metropolis-Hastings algorithm [A. 24], the Yts are gener- 
ated from the distributions q{y\x^^^). Assume that Ti = ~ /. 

(a) Show that the estimator 






1 ^ 



fivt) 

^ q{yt\xW) 



Kvt) 



is an unbiased estimator of Ef[h{X)]. 

(b) Derive, from the developments of Section 7.6.2, that the Rao-Blackwellized 
version of ^0 is 






h{x\) + 



/(j/2) 



h{y2) 



q{y2\xi) 



- 2)(1 






Kyi) 



(c) Compare (5i with the Rao-Blackwellized estimator of Theorem 7.21 in the 
case of a 7a distribution for the estimation of h{x) = Ix>2- 

7.31 Prove Theorem 7.19 as follows: 

(a) Use the properties of the Markov chain to show that the conditional prob- 
ability Ti can be written as 



Ti = ^ ^ P{Xi = yi\Xi-i = yj) P[Xi-i = yj) = ^ ^ Pji P{Xi~.\ = yj). 

j=0 j=0 
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(b) Show that 

P{Xi-i = i/j) = P{Xj = Uj, Xjj^i = yj ^ . . . , = yj) 

~ (1 ~ /^j(j + l)) ■ ■ ‘ (1 “ Pj(i-l)) 



and, hence, establish the expression for the weight (^i. 

7.32 Prove Theorem 7.21 as follows: 

(a) As in the independent case, the first step is to compute P{Xj = yi\yo,yi, 

. . . ,2/n)- The event {Xj = yi} can be written as the set of all the z-tuples 

leading to {Xi = ?/i}, of all the (j — z)-tuples (ui+i, . . . , Uj) 
corresponding to the rejection of . . . ,yj) and of all the (n — j)-tuples 

Uj+i, . . . , i^n following after Xj = yi. 

Define = {u\ > poi} and B\ = {u\ < poi}, and let B\{u\^ . . . ^Ut) 
denote the event {Xt = ^fc}. Establish the relation 

k-l 

Bl{ui, . . . ,ut) = y . . . ,Wfc-i) 

m=0 

<C Pmk-) '^fe+1 ^ Pk{t-\-l)-> • • ' •)Ut ^ Pfct}] ? 

and show that 

{xj=yi}= (J . . ,Ui-i) 

k=0 

<C Pki 5 Ui^l > ) 5 • • • ? Uj > Pij }] • 

(b) Let p{ui , . . . , ut, 2 / 1 , • • . , ^t) = p(u, y) denote the joint density of the U![s 
and the Y-s. Show that n = p(u, y)dui • • • dui, where A = IJI^o ('^i ? 

. . . ,Ui-i) n {ui < Pki). 

(c) Show that ujj-\-i — p(u, y)duj+i • • - dur and, using part (b), estab- 

lish the identity 

j j 

n H ~ Pit)q(yt+i\yiWj+i = Ti p.^uj}+i = n Qj oj}+i. 

t=i+l t=z+l 

(d) Verify the relations 

i — l 
t=0 

which provide a recursion relation on the cjj’s depending on acceptance 
or rejection of yj+i. The case j = T must be dealt with separately, since 
there is no generation of yr+i based on q{y\xT)> Show that is equal to 
piT + (1 — Pit) = 1 

(e) The probability P{Xj = yi) can be deduced from part (d) by computing 

the marginal distribution of (yi,...,yT). Show that 1 = P{Xt = 

yi) = YlJ=o '^iCi{T-i)y and, hence, the normalizing constant for part (d) is 

, which leads to the expression for ip. 
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7.33 (Liu 1996b) Consider a finite state-space X = {1, .. . ,m} and a Metropolis- 
Hastings algorithm on X associated with the stationary distribution tt = (tti, 
. . . , 7Tm) and the proposal distribution p = (pi, . . . ,Pm). 

(a) For LOi = 7Ti/pi and \k = express the transition matrix of 

the Metropolis-Hastings algorithm as K = G-hep^, where e == (1, . . . , 1)^. 
Show that G is upper triangular with diagonal elements the Afc’s. 

(b) Deduce that the eigenvalues of G and K are the A^’s. 

(c) Show that for \\p\\ = X] |pi|, 

7T(xJ 

following a result by Diaconis and Hanlon (1992). 

7.34 (O Ruanaidh and Fitzgerald 1996) Given the model 

yi = -f -h c + €i, 

where ei ~ and the tj’s are known observation times, study the es- 

timation of (Ai, A 2 , Ai, A 2 , cr) by recursive integration (see Section 5.2.4) with 
particular attention to the Metropolis-Hastings implementation. 

7.35 (Roberts 1998) Take / to be the density of the Exp{l) distribution and pi and 
Q 2 the densities of the Exp{0.1) and Sxp{5) distributions, respectively. The aim 
of this problem is to compare the performances of the independent Metropolis- 
Hastings algorithms based on the pairs (/, pi) and (/, p 2 )- 

(a) Compare the convergences of the empirical averages for both pairs, based 
on 500 replications of the Markov chains. 

(b) Show that the pair (/, pi) leads to a geometrically ergodic Markov chain 
and (/, P 2 ) does not. 

7.36 For the Markov chain of Example 7.28, define 

=I[0,13](^'‘^)+2I[13,oc)(^^‘^). 

(a) Show that (^^^^) is not a Markov chain. 

(b) Construct an estimator of the pseudo-transition matrix of (^^*^). 

7.37 Show that the transition associated with the acceptance probability (7.20) also 
leads to / as invariant distribution, for every symmetric function s. (Hint: Use 
the reversibility equation.) 

7.38 Show that the Metropolis-Hastings algorithm is, indeed, a special case of the 
transition associated with the acceptance probability (7.20) by providing the 
corresponding s{x,y). 

7.39 (Peskun 1973) Let Pi and P 2 be regular (see Problems 6.9 and 6.10), reversible 
stochastic matrices with the same stationary distribution tt on {1, . . . , m}. Show 
that if Pi < P 2 (meaning that the off-diagonal elements are smaller in the first 
case) for every function h, 

var /N , 

where (AT^^^) is a Markov chain with transition matrix Pi (z = 1, 2). (Hint: Use 
Kemeny and Snell 1960 result on the asymptotic variance in Problem 6.50.) 
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7.40 (Continuation of Problem 7.39) Deduce from Problem 7.39 that for a given 
instrumental matrix Q in a Metropolis-Hastings algorithm, the choice 



Pij = Qij 



'^j Qji 
TTi Qij 



A 1 



is optimal among the transitions such that 

Qij Sij 



Pij — 



1 + 



'^iQij 

TTjQji 



— QijOLij ? 



where Sij = Sji and 0 < aij < 1. {Hint: Give the corresponding aij for the 
Metropolis-Hastings algorithm and show that it is maximal for i ^ j. Tierney 
1998, Mira and Geyer 1998, and Tierney and Mira 1998 propose extensions to 
the continuous case.) 

7.41 Show that / is the stationary density associated with the acceptance proba- 
bility (7.20). 

7.42 In the setting of Example 7.28, implement the simulated annealing algorithm 
to find the maximum of the likelihood h{0\xi^X2^x^). Compare with the per- 
formances based on \og\j{9\x\^X2^xz). 

7.43 (Winkler 1995) A Potts model is defined on a set S of “sites” and a finite set 
G of “colors” by its energy 



H {x^ — ^ ^ (^st^Xs=xt ^ X ^ G , 

(s,t) 



where ast = octs^ the corresponding distribution being 7t{x) oc exp(i7(x)). An 
additional structure is introduced as follows: “Bonds” b are associated with each 
pair (s,t) such that ast > 0. These bonds are either active (6 = 1) or inactive 
(6 = 0). 

(a) Defining the joint distribution 

/i(x, b) oc n n (1 - Qst)l Xs=Xf ■) 

bst—0 bst — ^ 

with Qst = exp{ast), show that the marginal of /x in x is tt. Show that the 
marginal of /i in 5 is 

m(6) oc n n (1 - 9^*)’ 

bst—O ^st — ^ 



(b) 



where c{b) denotes the number of clusters (the number of sites connected 
by active bonds). 

Show that the Swendson-Wang (1987) algorithm 



1, Take — 0 if Xs ^ xt and, for Xs —Xtt 




with probability 1 — 
otherwise . 



2. For every cluster, choose a color at random on G. 



leads to simulations from tt {Note: This algorithm is acknowledged as ac- 
celerating convergence in image processing.) 
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7.44 (McCulloch 1997) In a generalized linear mixed model, assume the link func- 
tion is h{^i) = x'i/3 -h z[h^ and further assume that b == {bi^ . . . . hi) where 
b fh{h\D). (Here, we assume (p to be unknown.) 

(a) Show that the usual (incomplete-data) likelihood is 



L{e,^,D\y) = I n 



/(j/i|^i)/b(bp)db . 



(b) Denote the complete data by w = (y, b), and show that 



logLw = y^log/( 2 /i|d<) + log/b(6<|D) . 

i=l 

(c) Show that the EM algorithm, given by 

1. Choose starting values and 

2. Calculate (expectations evaluated under 

and ^ which maximize 

E{iog f{yi\Ouu,0,ip)\y]. 

3 p(m+l) 

maximizes E[/b{b]D)|^]. 

4. Set m to m + 1 . 
converges to the MLE. 

The next problems (7.45-7.49) deal with Langevin diffusions, as introduced in 
Section 7.8.5. 

7.45 Show that the naive discretization of (7.22) as dt = cr^, dLt = 
and dBt = ^t+ 0-2 — Bt does lead to the representation (7.23). 

7.46 Consider / to be the density of ^^(0, 1). Show that when cr = 2 in (7.23), the 
limiting distribution of the chain is A/’(0, 2). 

7.47 Show that (7.27) can be directly simulated as 



»=~Ar(y), ^ 

7.48 Show that when (7.24) exists and is larger (smaller) than 1 (—1) at —00 ( 00 ), 
the random walk (7.23) is transient. 

7.49 (Stramer and Tweedie 1999a) Show that the following stochastic differential 
equation still produces / as the stationary distribution of the associated process: 



dLt = a(Lt)V log /(LO -h b{Lt)dL 



when 

= ^Vlog/(a;)(T^(x) + a(x)Vcr(x). 

Give a discretized version of this differential equation to derive a IVletropolis- 
Hastings algorithm and apply to the case cr{x) = exp(a;|x|). 



7.8 Notes 



7.8.1 Background of the Metropolis Algorithm 

The original Metropolis algorithm was introduced by Metropolis et al. (1953) in a 
setup of optimization on a discrete state-space, in connection with particle physics: 
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the paper was actually published in the Journal of Chemical Physics. All the authors 
of this seminal paper were involved at a very central level in the Los Alamos re- 
search laboratory during and after World War II. At the beginning of the war, desk 
calculators (operated mostly by the wives of the scientists working in Los Alamos) 
were used to evaluate the behavior of nuclear explosives, to be replaced by the end 
of the war by one of the first computers under the impulsion of von Neumann. Both 
a physicist and a mathematician. Metropolis, who died in Los Alamos in 1999, came 
to this place in April 1943 and can be considered (with Stanislaw Ulam) to be the 
father of Monte Carlo methods. They not only coined the name “Monte Carlo meth- 
ods”, but also ran the first computer Monte Carlo experiment on the MANIAC^^ 
computer in 1948!^^ Also a physicist, Marshall Rosenbluth joined the Los Alamos 
Lab later, in 1950, where he worked on the development of the H bomb till 1953. 
Edward Teller is the most controversial character of the group: as early as 1942, he 
was one of the first scientists to work on the Manhattan Project that led to the pro- 
duction of the A bomb. Almost as early, he became obsessed with the hydrogen (H) 
bomb that he eventually managed to design with Stanislaw Ulam and better com- 
puter facilities in the early 1950s. Teller’s wife, Augusta (Mici) Teller, emigrated 
with him from Germany in the 1930s and got a Ph.D. from Pittsburgh University; 
she was also part of the team of “computers” operating the desk calculators in Los 
Alamos. 

The Metropolis algorithm was later generalized by Hastings (1970) and Peskun 
(1973, 1981) to statistical simulation. Despite several other papers that highlighted 
its usefulness in specific settings (see, for example, Geman and Geman 1984, Tanner 
and Wong 1987, Besag 1989), the starting point for an intensive use of Markov chain 
Monte Carlo methods by the statistical community can be traced to the presentation 
of the Gibbs sampler by Gelfand and Smith (1990) and Smith, as explained in Casella 
and George (1992) or Chib (1995). 

The gap of more than 30 years between Metropolis et al. (1953) and Gelfand and 
Smith (1990) can be partially attributed to the lack of appropriate computing power, 
as most of the examples now processed by Markov chain Monte Carlo algorithms 
could not have been treated previously. 

As shown by Hastings (1970), Metropolis-Hastings algorithms are a special case 
of a more general class of algorithms whose transition is associated with the accep- 
tance probability 



(7.20) 



e(x,y) 



sjx,y) 
fix)q{y\x) ’ 
f(y)q{x\y) 



where s is an arbitrary positive symmetric function such that g{x^ 2/) < 1 (see also 
Winkler 1995). The particular case s{x,y) = 1 is also known as the Boltzman al- 
gorithm and is used in simulation for particle physics, although Peskun (1973) has 
shown that, in the discrete case, the performance of this algorithm is always subop- 
timal compared to the Metropolis-Hastings algorithm (see Problem 7.40). 

MANIAC stands for Mathematical Analyzer, Numerical Integrator and Computer. 
Although Metropolis attributes the original idea to Enrico Fermi, 15 years earlier! 
To end up on a somber note, Edward Teller later testified against Oppenheimer 
in the McCarthy trials and, much later, was a fervent proponent of the “Star 
Wars” defense system under the Reagan administration. 
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For more details on the historical developments, see Hitchcock (2003). 

7.8.2 Geometric Convergence of Metropolis-Hastings Algorithms 

The sufficient condition (7.6) of Lemma 7.6 for the irreducibility of the Metropolis- 
Hastings Markov chain is particularly well adapted to random walks, with transition 
densities q{y\x) = g(y — x). It is, indeed, enough that g is positive in a neighborhood 
of 0 to ensure the ergodicity of [A. 24] (see Section 7.5 for a detailed study of these 
methods). On the other hand, convergence results stronger than the simple ergodic 
convergence of (7.1) or than the (total variation) convergence of ||F^^(o) ~ fWrv are 
difficult to derive without introducing additional conditions on / and q. For instance, 
it is impossible to establish geometric convergence without a restriction to the dis- 
crete case or without considering particular transition densities, since Roberts and 
Tweedie (1996) have come up with chains which are not geometrically ergodic. Defin- 
ing, for every measure u and every z/-measurable function h, the essential supremum 

essu sup h(x) = inf{ic; u{h{x) > w) = 0} , 

they established the following result, where p stands for the average acceptance 
probability. 

Theorem 7.27. If the marginal probability of acceptance satisfies 

essf sup (1 — p{x)) = 1, 

the algorithm [A. 24] is not geometrically ergodic. 

Therefore, if p is not bounded from below on a set of measure 1, a geometric 
speed of convergence cannot be guaranteed for [^4.24]. This result is important, as 
it characterizes Metropolis-Hastings algorithms that are weakly convergent (see the 
extreme case of Example 12.10); however, it cannot be used to establish nongeomet- 
ric ergodicity, since the function ~p is almost always intractable. 

When the state space ^ is a small set (see Chapter 6), Roberts and Poison (1994) 
note that the chain is uniformly ergodic. However, this is rarely the case when 

the state-space is uncountable. It is, in fact, equivalent to Doeblin’s condition, as 
stated by Theorem 6.59. Chapter 9 exhibits examples of continuous Gibbs samplers 
for which uniform ergodicity holds (see Examples 10.4 and 10.17), but Section 7.5 
has shown that in the particular case of random walks, uniform ergodicity almost 
never holds, even though this type of move is a natural choice for the instrumental 
distribution. 

7.8.3 A Reinterpretation of Simulated Annealing 

Consider a function E defined on a finite set E with such a large cardinality that a 
minimization of E based on the comparison of the values of E{S) is not fea^sible. The 
simulated annealing technique (see Section 5.2.3) is based on a conditional density 
q on S such that q{i\j) — q{j\i) for every (i,j) G . For a given value T > 0, it 
produces a Markov chain on E by the following transition: 

1. Generate Ct according to q{C\x^^^) . 
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2. Take 



^(t+i) ^ iCt with probability exp - E{Ct)}/T^ A 1 

otherwise. 

As noted in Section 5.2.3, the simulated value Ct is automatically accepted when 
E{(t) < E{X^^^). The fact that the simulated annealing algorithm may accept a 
value (t with E{(t) larger than E{X^^^) is a very positive feature of the method, 
since it allows for escapes from the attraction zone of local minima of E when T is 
large enough. The simulated annealing algorithm is actually a Metropolis-Hastings 
algorithm with stationary distribution f{x) oc exp(— jE(x)/T), provided that the 
matrix of the q{i\jys generates an irreducible chain. Note, however, that the theory 
of time- homogeneous Markov chains presented in Chapter 6 does not cover the 
extension to the case when T varies with t and converges to 0 “slowly enough” 
(typically in 1/logt). 

7.8.4 Reference Acceptance Rates 

Roberts et al. (1997) recommend the use of instrumental distributions with accep- 
tance rate close to 1/4 for models of high dimension and equal to 1/2 for the models 
of dimension 1 or 2. This heuristic rule is based on the asymptotic behavior of an 
efficiency criterion equal to the ratio of the variance of an estimator based on an 
iid sample and the variance of the estimator (3.1); that is, 

1 

1 + 2 ^ corr 

k>0 

in the case h{x) = x. When / is the density of the A/*(0, 1) distribution and g is 
the density of a Gaussian random walk with variance o’, Roberts et al. (1997) have 
shown that the optimal choice of a is 2.4, with an asymmetry in the efficiency in 
favor of large values of a. The corresponding acceptance rate is 

2 ^ (2 

p — — arctan — 

7T \(J 

equal to 0.44 for a — 2.4. A second result by Roberts et al. (1997), based on an 
approximation of X^^^ by a Langevin diffusion process (see Section 7.8.5) when 
the dimension of the problem goes to infinity, is that the acceptance probability 
converges to 0.234 (approximately 1/4). An equivalent version of this empirical rule 
is to take the scale factor in g equal to 2.38/ y/d X, where d is the dimension of the 
model and E is the asymptotic variance of X^^\ This is obviously far from being 
an absolute optimality result since this choice is based on the particular case of the 
normal distribution, which is not representative (to say the least) of the distributions 
usually involved in the Markov chain Monte Carlo algorithms. In addition, E is never 
known in practice. 

The implementation of this heuristic rule follows the principle of an algorithmic 
calibration, first proposed by Muller (1991); that is, of successive modifications of 
scale factors followed by estimations of E and of the acceptance rate until this 
rate is close to 1/4 and the estimation of E remains stable. Note, again, that the 
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0.991 
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0.891 
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ta 


106.9 


67.1 


32.6 


20.4 


9.37 


9.08 


9.15 


9.54 


hi 


41.41 


44.24 


44.63 


43.76 


42.59 


42.12 


42.92 


42.94 


h2 


0.035 


0.038 


0.035 


0.036 


0.036 


0.036 


0.035 


0.036 


ha 


0.230 


0.228 


0.227 


0.226 


0.228 


0.229 


0.228 


0.230 



Table 7.7. Performances of the algorithm [A.29] associated with (7.21) for xi = —8, 
xi = 8, and 0:3 = 17 and the random walk C(0, cr^). These performances are eval- 
uated via the acceptance probability the interjump time, ta, and the empirical 
variance associated with the approximation of the expectations E'^[hi{6)] (20,000 
simulations) . 



convergence results of Metropolis-Hastings algorithms only apply for thes€; adaptive 
versions when the different hyperparameters of the instrumental distribution are 
fixed: As long as these parameters are modified according to the simulation results, 
the resulting Markov chain is heterogeneous (or is not a Markov chain anymore if 
the parameters depend on the whole history of the chain). This cautioneiry notice 
signifies that, in practice, the use of a (true) Metropolis-Hastings algorithm must 
be preceded by a calibration step which determines an acceptable range for the 
simulation hyperparameters. 

Example 7.28. Cauchy posterior distribution. Consider Xi,X2, and A 3 iid 

C{6, 1) and Tr(^) = exp(— 0^/100). The posterior distribution on 6, 7r(^|xi, X2, 0:3), is 
proportional to 

(7.21) + {e- Xi)")(l + (0 - X2)")(l + {e- X3)")]“' , 

which is trimodal when xi,X2, and X3 are sufficiently spaced out, as suggested by 
Figure 1.1 for xi = 0, X2 = 5, and X3 = 9. This distribution is therefore adequate 
to test the performances of the Metropolis-Hastings algorithm in a unidirnensional 
setting. Given the dispersion of the distribution (7.21), we use a random walk based 
on a Cauchy distribution C(0, cr^). (Chapter 9 proposes an alternative approach via 
the Gibbs sampler for the simulation of (7.21).) 

Besides the probability of acceptance, a parameter of interest for the compar- 
ison of algorithms is the interjump time, ta] that is, the average number of iterations 
that it takes the chain to move to a different mode. (For the above values, the three 
regions considered are (— oo,0], (0,13], and (13, -Fcxd).) Table 7.7 provides., in addi- 
tion, an evaluation of [A.29], through the values of the standard deviation of the 
random walk cr, in terms of the variances of some estimators of E'^[hi{9)] for the 
functions 



hi (9) = 6, h2{9) = 




and h3(9) = I[4,s](^) • 



The means of these estimators for the different values of a are quite similar and 
equal to 8.96, 0.063, and 0.35, respectively. A noteworthy feature of this example is 
that the probability of acceptance never goes under 0.88 and, therefore, the goal of 
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Roberts et al. (1997) cannot be attained with this choice of instrumental distribution, 
whatever a is. This phenomenon is quite common in practice. || 



7.8.5 Lange vin Algorithms 

Alternatives to the random walk Metropolis-Hastings algorithm can be derived from 
diffusion theory as proposed in Grenander and Miller (1994) and Phillips and Smith 
(1996). The basic idea of this approach is to seek a diffusion equation (or a stochastic 
differential equation) which produces a diffusion (or continuous-time process) with 
stationary distribution / and then discretize the process to implement the method. 
More specifically, the Langevin diffusion Lt is defined by the stochastic differential 
equation 

(7.22) dLt = dBt + ^V log f{Lt)dt, 

where Bt is the standard Brownian motion^ that is, a random function such that 
Bo = 0, Bt ^ J\f{0,uj‘^t), Bt — Bt' ~ A/’(0,u;^|t — t'\) and Bt — Bt' is independent of 
Bt' {t > t'). (This process is the limit of a simple random walk when the magnitude 
of the steps. A, and the time between steps, r, both approach zero in such a way 
that 

limA/y/r = u. 

See, for example, Ethier and Kurtz 1986 , Resnick 1994, or Norris 1997. As stressed 
by Roberts and Rosenthal (1998), the Langevin diffusion (7.22) is the only non- 
explosive diffusion which is reversible with respect to /. 

The actual implementation of the diffusion algorithm involves a discretization 
step where (7.22) is replaced with the random walk like transition 

(7.23) + y V log + aet, 

where St ~ Ap(0, 7p) and corresponds to the discretization size (see Problem 
7.45). Although this naive discretization aims at reproducing the convergence of the 
random walk to the Brownian motion, the behavior of the Markov chain (7.23) may 
be very different from that of the diffusion process (7.22). As shown by Roberts and 
Tweedie (1995), the chain (7.23) may well be transient! Indeed, a sufficient condition 
for this transience is that the limits 

(7.24) lim a^Vlogf{x)\x\~'^ 

X— >rboo 

exist and are larger than 1 and smaller than —1 at — cxd and -f-oo, respectively, since 
the moves are then necessarily one-sided for large values of Note also the 

strong similarity between (7.23) and the stochastic gradient equation inspired by 
(5.4), 

(7.25) -f ^ V log /(x*'^^) -h ast • 

As suggested by Besag (1994), a way to correct this negative behavior is to treat 
(7.23) as a regular Metropolis-Hastings instrumental distribution; that is, to accept 
the new value Yt with probability 
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/(Yt) log f{x^*'>) ^2(7^1 

exp|- ||x(*) - Y - f Vlog/(Y)f /2 <t2| 

The corresponding Metropolis-Hastings algorithm will not necessarily outperform 
the regular random walk Metropolis-Hastings algorithm, since Roberts and Tweedie 
(1995) show that the resulting chain is not geometrically ergodic when Vlog/(a:) 
goes to 0 at infinity, similar to the random walk, but the (basic) ergodicity of this 
chain is ensured. 

Roberts and Rosenthal (1998) give further results about the choice of the scaling 
factor cr, which should lead to an acceptance rate of 0.574 to achieve optimal conver- 
gence rates in the special case where the components of x are uncorrelated^^ under 
/. Note also that the proposal distribution (7.23) is rather natural from a Laplace 
approximation point of view since it corresponds to a second-order approximation 
of /. Indeed, by a standard Taylor expansion, 

log = log /(x^^^) H - — x^^^yVlog 

+ [v'Vlog/(x'*^)] 

the random walk type approximation to is 

oc exp — x^*^)'Vlog/(x^^^) 

oc exp — 

-1- |^Vlog/(x^^^) -h 77(x^^^)x^^^)j I 

oc exp | — — {H{x^^^))~^V\og f{x^^^)]' H{x^^^) 

X - (i/(x^')))~^Vlog/(x^'^)]| , 

where H(x^^^) = — V'V log /(x^^^) is the Hessian matrix. If we simplify this approx- 
imation by replacing H{x^^^) with CT~^Ip, the Taylor approximation then leads to 
the random walk with a drift term 

(7.26) = x^^^ -f cr^ V log /(x^^^) -h (T£t . 

From an exploratory point of view, the addition of the gradient of log / is relevant, 
since it should improve the moves toward the modes of /, whereas only requiring 
a minimum knowledge of this density function (in particular, constants are not 
needed). Note also that, in difficult settings, exact gradients can be replaced by 
numerical derivatives. 

The authors also show that this corresponds to a variance of order whereas 

the optimal variance for the Metropolis-Hastings algorithm is of order p~^ (see 
Roberts et al. 1997). 
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Stramer and Tweedie (1999b) start from the lack of uniform minimal perfor- 
mances of (7.23) to build up modifications (see Problem 7.49) which avoid some 
pathologies of the basic Langevin Metropolis-Hastings algorithm. They obtain gen- 
eral geometric and uniformly ergodic convergence results. 



Example 7.29. Nonidentifiable normal model. To illustrate the performance 
of the Langevin diffusion method in a more complex setting, consider the noniden- 
tifiable model 



Y r^M 



( 



associated with the exchangeable prior 61^62 ~ A7(0, 1). The posterior distribution 



(7.27) T^{0i,02\y) oc exp (-i ^0f^+0^^ + - ( 6 >i + 6 > 2 ) 2 /| j 



is then well defined (and can be simulated directly, see Problem 7.47), with a ridge 
structure due to the nonidentifiability. The Langevin transition is then based on the 
proposal 



A/2 



a'^ l2y- 20^2^ - iy - 20' 



(t) 



ait) 



■ 50^'^ 



8 



+ { 0 ['\ 0 ^ 2 ^) 



a^h 



A preliminary calibration then leads to <7 = 1.46 for an acceptance rate of 1/2. As 




Fig. 7.9. Convergence of the empirical averages for the Langevin Metropolis- 
Hastings (full line) and iid simulation (daishes) of (^ 1 ,^ 2 ) for the estimation of E[^i] 
and corresponding 90% equal tail confidence intervals when y — 4.3. The final 
intervals are [1.373, 1.499] and [1.388, 1.486] for the Langevin and exact algorithms, 
respectively. 



seen in Figure 7.9, the 90% range is larger for the output of the Langevin algorithm 
than for an iid sample in the approximation of E[^i| 2 /], but the ratio of these ranges 
is only approximately 1.3, which shows a moderate loss of efficiency in using the 
Langevin approximation. || 
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The Slice Sampler 



He’d heard stories that shredded documents could be reconstructed. All it 
took was patience: colossal patience. 

— Ian Rankin, Let it Bleed 



While many of the MCMC algorithms presented in the previous chapter are 
both generic and universal, there exists a special class of MCMC algorithms 
that are more model dependent in that they exploit the local conditional fea- 
tures of the distributions to simulate. Before starting the general description 
of such algorithms, gathered under the (somewhat inappropriate) name of 
Gibbs sampling, we provide in this chapter a simpler introduction to these 
special kind of MCMC algorithms. We reconsider the fundamental theorem 
of simulation (Theorem 2.15) in light of the possibilities opened by MCMC 
methodology and construct the corresponding slice sampler. 

The previous chapter developed simulation techniques that could be called 
“generic,” since they require only a limited amount of information about the 
distribution to be simulated. For example, the generic algorithm ARMS (Note 
7.4.2) aims at reproducing the density / of this distribution in an automatic 
manner where only the numerical value of / at given points matters. However, 
Metropolis-Hastings algorithms can achieve higher levels of efficiency when 
they take into account the specifics of the target distribution /, in particu- 
lar through the calibration of the acceptance rate (see Section 7.6.1). Moving 
even further in this direction, the properties and performance of the method 
presented in this chapter are closely tied to the distribution /. 



8.1 Another Look at the Fundamental Theorem 

Recall from Section 2.3.1 that the generation from a distribution with density 
f{x) is equivalent to uniform generation on the subgraph of /, 
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= {{x,u)-, 0<u< f{x)} , 

whatever the dimension of x, and / need only be known up to a normalizing 
constant. 

From the development at the beginning of Chapter 7, we can consider 
the possibility of using a Markov chain with stationary distribution equal to 
this uniform distribution on ^(/) as an approximate way to simulate from 
/. A natural solution is to use a random walk on ^(/), since a random walk 
on a set usually results in a stationary distribution that is the uniform 
distribution on (see Examples 6.39, 6.40, and 6.73). 

There are many ways of implementing a random walk on this set, but a 
natural solution is to go one direction at a time, that is, to move iteratively 
along the u-a,xis and then along the x-axis. Furthermore, we can use uni- 
form moves on both directions, since, as formally shown below, the associated 
Markov chain on ^(/) does not require a Metropolis-Hastings correction to 
have the uniform distribution on ^(/) as stationary distribution. Starting 
from a point (x,ii) in {(x,ii) : 0 < u < /(x)}, the move along the li-axis will 
correspond to the conditional distribution 

(8.1) U\X = X - U{{u : u < /(x)}) , 

resulting in a change from point (x, u) to point (x, iz'), still in ^(/), and then 
the move along the x-axis to the conditional distribution 

(8.2) X\U = u' U{{x : u' < /(x)}), 

resulting in a change from point (x,zz') to point (x',i//). 

This set of proposals is the basis chosen for the original slice sampler of 
Neal (1997) (published as Neal 2003) and Damien et al. (1999), which thus 
uses a 2-step uniform random walk over the subgraph. We inaccurately call it 
the 2D slice sampler to distinguish it from the general slice sampler defined 
in Section 8.2, even though the dimension of x is arbitrary. 



Algorithm A*31 —2D slice sampler- 



At iteration simulate 




1. ~ W[0./(®t'>)]'’ 


[^.31] 


2. ~ with 









From (8.1) it is also clear that x^^^ is always part of the set which 

is thus nonempty. Moreover, the algorithm remains valid if /(x) = C/i(x), 
and we use fi instead of / (see Problem 8.1). This is quite advantageous in 
settings where / is an unnormalized density like a posterior density. 
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As we will see in more generality in Chapter 10, the validity of [A. 31] 
as an MCMC algorithm associated with /i stems from the fact that both 
steps 1. and 2. in [A. 31] successively preserve the uniform distribution on 
the subgraph of /: first, if ~ f{x) and ~ ^[ojiixW)]^ fhen 

~ f{x) 0^ Io<«</i(x) • 



Second, if ~ Z^^(t+i) , then 

^ y y /i(x(0) mes(A(^+i)) 

where mes(A^^“^^^) denotes the (generally Lebesgue) measure of the set 
Thus 



~cj !.<.</■(.) 



= CI, 



0<«</i(x(‘+D) J mes(^(<+i)) 






U<fljx) 



dx 



^ ^0<w</i(x(*+i)) 

and the uniform distribution on ^(/) is indeed stationary for both steps. 

Example 8.1. Simple slice sampler. Consider the density f{x) -- 

for X > 0. While it can be directly simulated from (Problem 8.2), it also yields 

easily to the slice sampler. Indeed, applying (8.1) and (8.2), we have 



U\x ~ U (o, ^e-^) , X\ur^U (0, [log(2u)]2) . 



We implement the sampler to generate 50, 000 variates, and plot them along 
with the density in Figure 8.1, which shows that the agreement is very good. 
The performances of the slice sampler may however deteriorate when ^/x 
is replaced with x^/^ for d large enough, as shown in Roberts and Tweedie 
(2004). II 



Example 8.2. Truncated normal distribution The distribution repre- 
sented in Figure 8.2 is a truncated normal distribution (3, 1), restricted to 
the interval [0, 1], 

f{x) oc fi{x) = exp{-(a; + 3)^/2} I[o,i](a;) • 

As mentioned previously, the naiVe simulation of a normal AT(3, 1) random 
variable until the outcome is in [0, 1] is subopt imal because there is only 
a 2% probability of this happening. However, devising and optimizing the 
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Slice 




K 



Fig. 8.1. The slice sampler histogram and density for Example 8.1. 



Accept-Reject algorithm of Example 2.20 can be costly if the algorithm is to 
be used only a few times. The slice sampler [A. 31] applied to this problem is 
then associated with the horizontal slice 

^(*+1) = {y; exp{-(?/ + 3)^/2} >m/i(x(*^)}. 

If we denote the value u/i(x^*^), the slice is also given by 

= {ye [0, 1]; (y + 3)^ < -2 log(u;(‘))} , 

which is an interval of the form [0,7^^^]. Figure 8.2 shows the first ten steps 
of [A. 31] started from = 0.25. || 

This algorithm will work well only if the exploration of the subgraph of 
fi by the corresponding random walk is fast enough; if we take the above 
example of the truncated normal, this is the case. Whatever the value of 
the next value can be anything in [0, 1]. This property actually holds 

for all /’s: given that, when is close to 0, the set is close to the 

entire support of / and, formally, every value in the support of / can 

be simulated in one step. (We detail below some more advanced results about 
convergence properties of the slice sampler.) 

Figure 8.3 illustrates the very limited dependence on the starting value for 
the truncated normal example: the three graphs represent a sample of values 
of when is in the upper left hand corner, the lower right 

hand corner and the middle of the subgraph, respectively. To study the effect 
of the starting value, we always used the same sequence of uniforms for the 
three starting points. As can be seen from the plot, the samples are uniformly 
spread out over the subgraph of / and, more importantly, almost identical! 






Fig. 8.3. Comparison of three samples obtained by ten iterations of the slice 
sampler starting from (.01, .01) (left)^ (.99, .001) (center) and (.25, .025) (right) and 
based on the same pool of uniform variates. 



In this simple example, ten iterations are thus enough to cancel the effect of 
the starting value. 

The major (practical) difficulty with the slice sampler is with the simula- 
tion of the uniform distribution , since the determination of the set of 

2 /’s such that fi{y) > oo can be intractable if fi is complex enough. We thus 
consider an extension to [.A. 31] in the next section, but call attention to an 
extension by Neal (2003) covered in Note 8.5.1. 
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8.2 The General Slice Sampler 

As indicated above, sampling uniformly from the slice = {x; fi{x) > 
may be completely intractable, even with the extension of Note 8.5.1. This 
difficulty persists as the dimension of x gets larger. However, there exists 
a generalization of the 2D slice sampler [A. 31] that partially alleviates this 
difficulty by introducing multiple slices. 

This general slice sampler can be traced back to the auxiliary variable 
algorithm of Edwards and Sokal (1988) applied to the Ising model of Example 
5.8 (see also Swendson and Wang 1987, Wakefield et al. 1991, Besag and Green 
1993, Damien and Walker 1996, Higdon 1996, Neal 1997, Damien et al. 1999, 
and Tierney and Mira 1999). It relies upon the decomposition of the density 
f{x) as 

k 

f{x) oc 

i=l 

where the fiS are positive functions, but not necessarily densities. For in- 
stance, in a Bayesian framework with a fiat prior, the fi{x) may be chosen as 
the individual likelihoods. 

This decomposition can then be associated with k auxiliary variables 
rather than one as in the fundamental theorem, in the sense that each fi{x) 
can be written as an integral 

fi{x) J ^0<uji<fi{x) diOi , 

and that / is the marginal distribution of the joint distribution 

k 

(8.3) '^p{x,u>i,...,ujk) oc ]][^o<c^,</i(x) • 

This particular demarginalization of / (Section 5.3.1) introduces a larger di- 
mensionality to the problem and induces a generalization of the random walk 
of Section 8.1 which is to have uniform proposals one direction at a time. The 
corresponding generalization of [A. 31] is thus as follows. 

Algorithm A, 32 -Slice Sampler- 
At iteration t-J-lj simulate 

j (£+1) 7 / 

1. W, /)(*(•))]' 

: [A.32] 

1 (*+i> j/ 

UJf. ~ '))],- 

k+1 . ~ , with 
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Example 8.3. A 3D slice sampler. Consider the density proportional to 

(1 + sin^(3x)) (1 + cos^(5x)) exp{— x^/2}. 

The corresponding functions are, for instance, fi{x) = (1 -hsin^(3x)), f2{x) = 
(1 + cos^(5x)), and fs{x) = exp{— x^/2}. In an iteration of the slice sampler, 
three uniform U{[0,1]) ui,U2^us are generated and the new value of x is 
uniformly distributed over the set 

: |x| < a/— 21ogo;3|n{x : sin^(3x) > 1 — o;i}n{x : cos^(5x) >1 — 002 ] , 

which is made of one or several intervals depending on whether uoi = uifi{x) 
and 002 = 1 ^ 2 / 2 (^) are larger than 1. Figure 8.4 shows how the sample produced 
by the algorithm fits the target density. || 




Fig. 8.4. Histogram of a sample produced by 5, 000 iterations of a 3D slice sam- 
pler and superposition of the target density proportional to (1 -h sin^(3x)) (1 -h 
cos^(5x)) exp{— x^/2}. 



In Section 5.3.1, the completion was justified by a latent variable repre- 
sentation of the model. Here, the auxiliary variables are more artificial, but 
there often is an obvious connection in (natural) latent variable models. 

Example 8.4. Censored data models. As noted in Section 5.3.1, censored 
data models (Example 5.14) can be associated with a missing data structure. 
Consider 

y* = (2/1, 2 / 2 ) = ( 2 / A r-,Iy<r) , with y ~ f{y\0) and r ~ h{r ) , 

SO the observation y is censored by the random variable r. If we observe 
the density of y* = (y^, 2/2i) is then 
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r-\-oo r-\-oo 

(8.4) / f{y\0) dy h{yl,) + / h{r) dr fiy^e) . 

dyli -^vu 

In the cases where (8.4) cannot be explicitly integrated, the likelihood and 
posterior distribution associated with this model may be too complex to be 
used. If 6 has the prior distribution tt, the posterior distribution satisfies 



■ ■ ■ ,y*n) o^T^{0) tJ / f{y\6)dy' 

n |/(yiil^) J ^ /i(r)dr| 



/ + 00 

f{y\6)dy JJ f{yu\0). 

If the uncensored model leads to explicit computations for the posterior 
distribution of we will see in the next chapter that a logical completion 
is to reconstruct the original data, ^i, . . . conditionally on the observed 
?/*’s (z = 1, . . . , n), and then implement the Gibbs sampler on the two groups 
6 and the unobserved y’s. At this level, we can push the demarginalization 
further to first represent the above posterior as the marginal of 

4^) n fiyii\^)^yu>v-u n 

and write this distribution itself as the marginal of 

Io<u;o<7r((9) {^yu>yli^0<uji<f{yii\e)} JJ ^0<a;i</(?/*J(9) 5 

{i--y2i=^} {i--y2i='^} 



adding to the unobserved yus the basic auxiliary variables coi {0 < i < n). 



Although the representation (8.3) of / takes us farther away from the fun- 
damental theorem of simulation, the validation of this algorithm is the same 
as in Section 8.1: each of the k I steps in [A. 32] preserves the distribution 
(8.3). This will also be the basis of the Gibbs sampler in Chapters 9 and 10. 

While the basic appeal in using this generalization is that the set 
may be easier to compute as the intersection of the slices 

= My) 

there may still be implementation problems. As k increases, the determina- 
tion of the set usually gets increasingly complex. Neal (2003) develops 

proposals in parallel to improve the slice sampler in this multidimensional 
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setting, using “over-relaxed” and “reflective slice sampling” procedures. But 
these proposals are too specialized to be discussed here and, also, they re- 
quire careful calibration and cannot be expected to automatically handle all 
settings. Note also that, in missing variable models, the number of auxiliary 
variables (that is, of slices) increases with the number of observations and 
may create deadlocks for large data sizes. 



8.3 Convergence Properties of the Slice Sampler 



Although, in the following chapters, we will discuss in detail the convergence 
of the Gibbs sampler, of which the slice sampler is a special case, we present 
in this section some preliminary results. These are mostly due to Roberts and 
Rosenthal (1998) and and show that the convergence rate of the slice sampler 
can be evaluated quantitatively. 

First, note that, for the 2D slice sampler [A. 31], if we denote by p:(a;) the 
Lebesgue measure of the set 



= {y; fi(y) > ^}, 



the transition kernel is such that 






// 

U 

1 r 

V Jo 






dw d,x 
V ii{w) 



/i(w) 



dw 



max I 1 — , 0 1 dir . 

V 7n V /x(rc)’ ' 



The properties of the chain are thus entirely characterized by the measure /x 
and, moreover, they are equivalent for (X^*+^)) and (/i(X^^+^^)) which also 
is a Markov chain in this case (Problems 8.9 and 8.10). 

Under boundedness conditions, Tierney and Mira (1999) established the 
following uniform ergodicity result. 



Lemma 8.5. If fi is bounded and supp/i is bounded, the slice sampler [A. 31] 
is uniformly ergodic. 



Proof. Without loss of generality, assume that /i is bounded by 1 and that 
the support supp/i is equal to (0,1). To establish uniform ergodicity, it is 
necessary and sufficient to prove that Doeblin’s condition (Theorem 6.59) 
holds, namely that the whole space supp/i is a small set. This follows from 
the fact that 

e(t;) = Pr (/i(V(‘+i>) < r?|/i(X«) = v) 
is decreasing in v for all rj. In fact, when v > rj, 
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fl{w) 



is clearly decreasing in v. When v < r], is the average of the function 



tJ'iv) 

lx{W) 

when W is uniformly distributed over (0,u). Since /x(o;) is decreasing in cj, 
this average is equally decreasing in v. The maximum of ^(u) is thus 



lim = lim 

^ ^ v -^0 



Kv) 

fi{v) 



1 - i^iv) 



by L’Hospital’s rule, while the minimum of ^{v) is 



lim 

V— ^1 




fl{w) 



Since we are able to find nondegenerate upper and lower bounds on the cdf 
of the transition kernel, we can derive Doeblin’s condition and thus uniform 
ergodicity. □ 



In a more general setting, Roberts and Rosenthal (1998) exhibit a small 
set associated with the 2D slice sampler. 

Lemma 8.6. For any {x; < fi{x) < e^} is a small set, that is, 



') > ^ T^{') where v 



(^) = - f 

e* Jo 



X{An{y;fi{y) > e}) 
y{e) 



d, 



and A denotes Lebesgue measure. 
Proof. For < x < e* , 



^ Jo 

>1 f 

e* Jo 



0 KO 

X{A n {y, fi{y) > e}) 



M(e) 



de 
de . 



We thus recover the minorizing measure given in the lemma. 



□ 



If fi is bounded, can be chosen as the maximum of fi and the set 
is a small set. Roberts and Rosenthal (1998) also derive a drift condition (Note 
6.9.1), as follows. 



Lemma 8.7. Assume 

(i) f I is hounded by 1, 

(a) the function fi is differentiable, 
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(in) is non-increasing for y < and an a > 1. 

Then, for 0 < (3 < min(Q; — 1, V)joL, 

V{x) = 



is a drift function, that is, it satisfies the drift condition ( 6 , 42 ) on the sets 
y{e*). 

Proof Recall that ^(e*) = {y; fi{y) > e"*"}. If fi{x) < e'^ , the image of V by 
the slice sampler transition kernel satisfies 



1 rh{x) 1 r 

KV{x) = ---- / -TT / h{y)~^dydw 

h{x) Jo J^(^) 

1 rfi{x) 1 

= —— / -TT / z~<^{-y!{z))dz(Lj 

f\{x) Jo y{^) J^ 

r/i(*) f z-d{-^'(z))dz 



< f 

~ fi{x) Jo 

~ fl{x) Jo 

l + a0 fi{x) 



Iy*i-f^'(z))dz 

/i(^) p dz 

py 

P\-{l+l/a 

yy 



duo 



duo 



fi(x) rfi(x) 

uo~^ duo - 

Jo 



{uo-^ - e*”^) 



uo 



— If a 



e-l/o: 



(Lo 



< 



V{x) 



+ 



a/3e* 



-/3 



(1 -/?)(! + a/?)) 1 + a/? 



(see Problem 8.12 for details). 

When fi{x) > e*, the image KV{x) is non-increasing with fi(x) and, 
therefore, 



KV{x) < KV{xo) = 



■k(3 1 + a(3{l - (3) 
^ (l + a/J)(l-^) 



for xq such that fi{xo) = e^. 
Therefore, outside 



KV{x) < 



1 

(I ~ /^)(I + ^P)) 



+ 



g/?(e^/e*)^ 

l + a(3 



V{x) = \V{x) 



for any e* < e* and, on y{e*), 



KV{x) < 6 = -e 



l + g/3(l-/3) 
(1 + g/3)(l -/?) 



A, 



which establishes the result. 



□ 
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Therefore, under the conditions of Lemma 8.7, the Markov chain associated 
with the slice sampler is geometrically ergodic, as follows from Theorem 6.75. 
In addition, Roberts and Rosenthal (1998) derive explicit bounds on the total 
variation distance, as in the following example, the proof of which is beyond 
the scope of this book. 

Example 8.8. Exponential Exp{l) distribution If fi{x) = exp(— x), 
Roberts and Rosenthal (1998) show that, for n > 23, 

(8.5) •) - fi)\\TV < .054865 (0.985015)^ (n - 15.7043) . 

This implies, for instance, that, when n — 530, the total variation distance 
between and / is less than 0.0095. While this figure is certainly over- 
conservative, given the smoothness of /i, it is nonetheless a very reasonable 
bound on the convergence time. || 

Roberts and Rosenthal (1998) actually show a much more general result: 
For any density such that e/a'{e) is non-increasing, (8.5) holds for all x’s 
such that f{x) > .0025 sup /(x). In the more general case of the (product) 
slice sampler [yl.32], Roberts and Rosenthal (1998) also establish geometric 
ergodicity under stronger conditions on the functions fi. 

While unidimensional log-concave densities satisfy the condition on fi 
(Problem 8.15), Roberts and Rosenthal (2001) explain why multidimensional 
slice samplers may perform very poorly through the following example. 

Example 8.9. A poor slice sampler. Consider the density f{x) oc exp {— ||x 
in M^. Simulating from this distribution is equivalent to simulating the radius 
z = |x| from 

rjd{z) oc , z>0, 

or, by a change of variable u = z'^, from 

TTfiyu) (xe , u > {} . 

If we run the slice sampler associated with tt^, the performances degener- 
ate as d increases. This sharp decrease in the performances is illustrated by 
Figure 8.5: for d = 1,5, the chain mixes well and the autocorrelation func- 
tion decreases fast enough. This is not the case for d = 10 and even less for 
d = 50, where mixing is slowed down so much that convergence does not seem 
possible! || 

Roberts and Rosenthal (2001) take advantage of this example to propose 
an alternative to the slice sampler called the polar slice sample, which relates 
more to the general Gibbs sampler of next chapter (see Problem 9.21). 
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Fig. 8.5. Rawplots and autocorrelation functions for the series generated by 
the slice samplers associated with fi{z) = z^~^ e~^ for d = 1, 5, 10, 50. 



8.4 Problems 

8.1 Referring to Algorithm [A. 31] and equations (8.1) and (8.2): 

(a) Show that the stationary distribution of the Markov chain [A. 31] is the 
uniform distribution on the set {(x,u) : 0 < u < f{x)}. 

(b) Show that the conclusion of part (a) remains the same if we use /i in [A. 31], 
where f{x) = Cfi{x). 

8.2 In the setup of Example 8.1, show that the cdf associated with the density 
exp(— y^) on R+ can be computed in closed form. {Hint: Make the change of 
variable 2 : = ^/x and do an integration by parts on 2 : exp (— 2 :).) 

8.3 Consider two unnormalized versions of a density /, fi and / 2 . By implementing 
the slice sampling algorithm [A. 31] on both fi and / 2 , show that the chains 
(x^^^)t produced by both versions are exactly the same if they both start from 
the same value and use the same uniform r\j U{[0,1]). 

8.4 As a generalization of the density in Example 8.1, consider the density f{x) oc 
exp{— x^}, for d < 1. Write down a slice sampler algorithm for this density, and 
evaluate its performance for d = .1, .25, .4. 

8.5 Show that a possible slice sampler associated with the standard normal density, 
f{x) oc exp(— x^/2), is associated with the two conditional distributions 

(8.6) L0\X ~ W[0,exp(-a;2/2)l > ^ ([- \/-2 log(w), -y/^21^(w)] j . 

Compare the performances of this slice sampler with those of an iid sampler 
from A/’(0, 1) by computing the empirical cdf at 0, .67, .84, 1.28, 1.64, 1.96, 
2.33, 2.58, 3.09, and 3.72 for two samples of same size produced under both 
approaches. (Those figures correspond to the .5, .75, .8, .9, .95, .99, .995, .999 
and .9999 quantiles, respectively.) 

8.6 Reproduce the comparison of Problem 8.5 in the case of (a) the gamima distri- 
bution and (b) the Poisson distribution. 
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8.7 Consider the mixture distribution (1.3), where the pj’s are unknown and the 
/j’s are known densities. Given a sample xi, . . . , Xn, examine whether or not a 
slice sampler can be constructed for the associated posterior distribution. 

8.8 (Neal 2003) Consider the following hierarchical model {i = 1, . . . , 10) 



Xi\v ~ A/*(0, e"") , V AA(0, 3) . 



(a) When simulating this distribution from a independent Metropolis-Hastings 
algorithm with A/*(0, 1) proposals on all variables, show that the algorithm 
fails to recover the proper distribution for v by looking at the smaller values 
of V. 

(b) Explain why this poor behavior occurs by considering the acceptance prob- 
ability when V is smaller than —5. 

(c) Evaluate the performance of the corresponding random Metropolis-Hastings 
algorithm which updates one component at a time with a Gaussian proposal 
centered at the current value and variance equal to one. Explain why the 
problem now occurs for the larger values of v. 

(d) Compare with a single- variable slice sampling, that is, slice sampling applied 
to the eleven full conditional distributions of v given the Xi^s and of the x^’s 
given the Xj’s and v. 

8.9 (Roberts and Rosenthal 1998) Show that, for the slice sampler [^.31], 

is a Markov chain with the same convergence properties as the original chain 

8.10 (Roberts and Rosenthal 1998) Using the notation of Section 8.3, show that, if 
/i and /i are two densities on two spaces X and X such that ii{w) = jl{aw) for 
all re’s, the kernels of the chains (/i(X^*^)) and (/i(X^^^)) are the same, even 
when the dimensions of X and X are different. 

8.11 (Roberts and Rosenthal 1998) Consider the distribution on with density 
proportional to 

/o(x) fi(y)>uj}{^) • 

Let T be a differentiable injective transformation, with Jacobian J. 

(a) Show that sampling x from this density is equivalent to sampling z = T{x) 
from 

and deduce that Pf„{x,A) = P^t{T{x),T{A)). 

(b) Show that the transformation T(x) = (Ti(x), X 2 , . . . , x^), where Ti(x) = 
fo^ /o(C^ 2 , . . . ,Xd)dt, is such that fo{T~^{z))/J{T~^{z)) is constant over 
the range of T. 

8.12 (Roberts and Rosenthal 1998) In the proof of Lemma 8.7, when /i(x) < e*: 

(a) Show that the first equality follows from the definition of V. 

(b) Show that the second equality follows from 




{-li{z))dz. 



(c) Show that the first inequality follows from the fact that 2 ”^ is decreasing 
and the ratio can be expressed as an expectation. 

(d) Show that, if gi and Q 2 are two densities such that gi{x) / g 2 {x) is increasing, 
and if h is also increasing, then f h{x)gi{x)dx > f h(x)g 2 {x)dx . Deduce the 
second inequality from this general result. 




(e) Establish the third equality. 

(f) Show that 
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is an increasing function of z < e* when Pa < 1. Derive the last inequality. 

8.13 (Roberts and Rosenthal 1998) Show that is non-increasing in the 

following two cases: 

(a) X = M"*" and /i {x) oc exp — yx for x large enough; 

(b) X = M"*" and /i(x) oc x~~^ for x large enough. 

8.14 (Roberts and Rosenthal 1998) In Lemma 8.7, show that, if e* = 1, then the 

bound h on KV{x) when x £ y{e^) simplifies into b = • 

8.15 (Roberts and Rosenthal 1998) Show that, if the density / on M is log-concave, 
then ufi{ujy is non- increasing. {Hint: Show that, if //~^(o;) is log-concave, then 
uj^{u)' is non-increasing.) 

8.16 (Roberts and Rosenthal 1998) Show that, if the density / in is such that 

is non-increasing for all ^’s and D(y; 6) — sup {t > 0; f(t9) < y} , then u:/j,{u:y 
is non-increasing. 

8.17 Examine whether or not the density on M+ defined as f{u) oc exp — is 
log-concave. Show that f{x) oc exp— |la:|| is log-concave in any dimension d. 

8.18 In the general strategy proposed by Neal (2003) and discussed in Note 8.5.1, 
y can be shrunk during a rejection sampling scheme as follows: if ^ is rejected 
as not belonging to change Lt to ^ if ^ < x^^^ and i?t to ^ if ^ > x^^K Show 
that this scheme produces an acceptable interval y at the end of the rejection 
scheme. 

8.19 In the stepping-out procedure described in Note 8.5.1, show that choosing 
ijj too small may result in an non-irreducible Markov chain when is not 
connected. Show that the doubling procedure avoids this difficulty due to the 
random choice of the side of doubling. 

8.20 Again referring to Note 8.5.1, when using the same scale uj, compare the 
expansion rate of the intervals in the stepping-out and the doubling procedures, 
and show that the doubling procedure is faster. 

8.5 Notes 

8.5.1 Dealing with Difficult Slices 

After introducing the term “slice sampling” in Neal (1997), Neal (2003) proposed 
improvements to the standard slice sampling algorithm [A. 31]. In particular, given 
the frequent difficulty in generating exactly a uniform , he suggests to im- 

plement this slice sampler in a univariate setting by replacing the “slice” with 
an interval y ~ (Lt,Rt) that contains most of the slice. He imposes the condition 
on y that the set of x’s in A^^^ n y where the probability of constructing y 
starting from x is the same as the probability of constructing y starting from 
can be constructed easily. This property is essential to ensure that the stationary 
distribution is the uniform distribution on the slice A^^\ (We refer the reader to Neal 
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2003 for more details about the theoretical validity of his method, but warn him/her 
that utmost caution must be exercised in the construction of for detailed balance 
to hold!) As noted in Neal (2003) and Mira and Roberts (2003), irreducibility of the 
procedure may be a problem (Problem 8.19). 

Besides the obvious choices when is and when it is the entire range of the 
(bounded) support of /, an original solution of Neal (2003), called the “stepping- 
out” procedure, is to create an interval of length uj containing and to expand 

this interval in steps of size u till both ends are outside the slice Similarly, 

his “doubling procedure” consists in the same random starting interval of length 
u; whose length is doubled (leftwise or rightwise at random) recursively till both 
ends are outside the slice. While fairly general, this scheme also depends on a scale 
parameter uj that may be too large for some values of and too small for others. 

Once an acceptable interval has been produced, a point ^ is drawn at random 
on if it belongs to h is accepted. Otherwise, repeated sampling must be used, 
with the possible improvement on gathered from previous failed attempts (see 
Problem 8.18). 
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The Two-Stage Gibbs Sampler 



All that mattered was that slowly, by degrees, by left and right then left 
and right again, he was guiding them towards the destination. 

— Ian Rankin, Hide and Seek 

The previous chapter presented the slice sampler, a special case of a Markov 
chain algorithm that did not need an Accept-Reject step to be valid, seemingly 
because of the uniformity of the target distribution. The reason why the slice 
sampler works is, however, unrelated to this uniformity and we will see in this 
chapter a much more general family of algorithms that function on the same 
principle. This principle is that of using the true conditional distributions 
associated with the target distribution to generate from that distribution. 

In order to facilitate the link with Chapter 8, we focus in this chapter on 
the two-stage Gibbs sampler before covering in Chapter 10 the more general 
case of the multi-stage Gibbs sampler. There are several reasons for this di- 
chotomy. First, the two-stage Gibbs sampler can be derived as a generalization 
of the slice sampler. The two algorithms thus share superior convergence prop- 
erties that do not apply to the general multistage Gibbs sampler. A second 
reason is that the two-stage Gibbs sampler applies naturally in a wide range of 
statistical models that do not call for the generality of Chapter 10. Lastly, the 
developments surrounding the Gibbs sampler are so numerous that starting 
with a self-contained algorithm like the two-stage Gibbs sampler provides a 
gentler entry to the topic. 



9.1 A General Class of Two-Stage Algorithms 

9.1.1 Prom Slice Sampling to Gibbs Sampling 

Instead of a density fx{x) as in Chapter 8, consider now a joint density /(x, y) 
defined on an arbitrary product space, ^ x If we use the fundamental 
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theorem of simulation (Theorem 2.15) in this setup, we simulate a uniform 
distribution on the set 

‘^(/) = {{x,y,u) :0<u< /(x,y)} . 

Since we now face a three-component setting, a natural implementation of the 
random walk principle is to move uniformly in one component at a time. This 
means that, starting at a point (x, y, in <?(/), we generate 

(i) X along the x-axis from the uniform distribution on {x : < f(x,y)j, 

(ii) Y along the y-axis from the uniform distribution on {y : u < /(x',y)}, 
(hi) U along the u-axis from the uniform distribution on [0, /(x',y')]. 

(Note that this is different from the 3D slice sampler of Example 8.3.) 

There are two important things to note: 

(1) Generating from the uniform distribution on {x : u < f(x,y)j is equiva- 
lent to generating from the uniform distribution 

{x : fx\Y{x\y) > u/friy)}, 

where fx\Y fy denote the conditional and marginal distributions of 
X given Y and of Y, respectively, that is, 

(9.1) fviy) = j f{x,y)dx and fx\Y{x\y) = ■ 

(2) The sequence of uniform generations along the three axes does not need 
to be done in the same order x-y-u all the time for the Markov chain to 
remain stationary with stationary distribution the uniform on S{f). 

Thus, for example, simulations along the x and the u axes can be repeated 
several times before moving to the simulation along the y axis. If we put these 
two remarks together and consider the limiting case where the X and the U 
simulations are repeated an infinite number of times before moving to the Y 
simulation, we end up with a simulation (in X) of 

X ~ fx\Y(x\y ) , 

by virtue of the 2D slice sampler. Consider now the same repetition of Y 
and U simulations with X fixed at its latest value: in the limiting case, this 
produces a simulation (in E) of 

y fY\x{y\x ) . 

Assuming that both of these conditional distributions can be simulated, we can 
therefore implement the limiting case of the slice sampler and still maintain 
the stationarity of the uniform distribution. In addition, the simulation of the 
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[/’s gets somehow superfluous since we are really interested in the generation 
from /(x, 2/), rather than from the uniform on =5^(/). 

Although this is not the way it was originally derived, this introduction 
points out the strong link between the slice sampler and the two-stage Gibbs 
sampler, sometimes called Data Augmentation (Tanner and Wong 1987). It 
also stresses how the two-stage Gibbs sampler takes further advantage of the 
knowledge on the distribution /, compared with the slice sampler that uses 
only the numerical values of f{x). This is reflected in the fact that each step 
of the two-stage Gibbs sampler amounts to an inflnity of steps of a special 
slice sampler. (Note that this does not mean that the two-stage Gibbs sampler 
always does better than any slice sampler.) 

9.1.2 Definition 

The algorithmic implementation of the two-stage Gibbs sampler is thus 
straightforward. If the random variables X and Y have joint density f{x,y), 
the two-stage Gibbs sampler generates a Markov chain (Xt^Yt) according to 
the following steps: 

Algorithm A. 33 -Tvy^o-stage Gibbs sampler- 



Take Ao = xq 

For generate 




1. Yt'-^ fY\xi-\xi-i); 

2. Xt ~ /x|y(-|yt) . 


[A.33] 



where fy\x fx\v are the conditional distributions associated with /, as 
in (9.1). 

We note here that not only is the sequence (Xt^Yt) a Markov chain, but 
also each subsequence (At) and (Ft) is a Markov chain. For example, the chain 
(At) has transition density 

K{x,x*) = J fY\x{y\x)fx\Y{x*\y)dy, 

which indeed depends on the past only through the last value of (At). (Note 
the similarity to Eaton’s transition (6.46).) In addition, it is also easy to show 
that fx is the stationary distribution associated with this (sub)chain, since 

fx{x') = J fx\Yix'\y)fY(y)dy 

(9.2) = J fx\Y{x'\y) j fY\x{y\x)fx(x)dxdy 

"" J fx\Y{x'\y)fY\x{y\x)dy fx{x)dx . 

= J K{x,x')fx{x)dx . 
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Example 9.1. Normal bivariate Gibbs. For the special case of the bivari- 
ate normal density, 

(9.3) , 

the Gibbs sampler is 
Given yt, generate 

(9.4) Xt+i I yt ~ Afim, 1 - ; 

Yt+i I xt+i ^Af{pxt+i, 1 - p^). 

The Gibbs sampler is obviously not necessary in this particular case, as iid 
copies of (X, y) can be easily generated using the Box-Muller algorithm (see 
Example 2.8). Note that the corresponding marginal Markov chain in X is 
defined by the AR(1) relation 



Xt+i = p^Xt + aet, €t ^ A/'(0, 1) , 

with (7^ = 1— = 1— As shown in Example 6.43, the stationary 
distribution of this chain is indeed the normal distribution J\f ^0, • II 

This motivation of the two-stage Gibbs sampler started with a joint distri- 
bution /(x,p). However, what we saw in Chapter 8 was exactly the opposite 
justification. There, we started with a marginal density fx{x) and constructed 
(or completed) a joint density to aid in simulation where the second variable 
Y (that is, U in slice sampling) is an auxiliary variable that is not directly 
relevant from the statistical point of view. This connection will be detailed 
in Section 10.1.2 for the general Gibbs sampler, but we can point out at this 
stage that there are many settings where a natural completion of fx{x) into 
f{x,y) does exist. One such setting is the domain of missing data models^ 
introduced in Section 5.3.1, 

f{x\e)= f g{x,z\e)dz, 

J 2T 

as, for instance, the mixtures of distributions (Examples 1.2 and 5.19). 

Data Augmentation was introduced independently (that is, unrelated to 
Gibbs sampling) by Tanner and Wong (1987), and is, perhaps, more closely 
related to the EM algorithm of Dempster et al. (1977) (Section 5.3.2) and 
methods of stochastic restoration (see Note 5.5.1). It is even more related to 
recent versions of EM such as ECM and MCEM (see Meng and Rubin 1991, 
1992 and Rubin, Liu and Rubin 1994, and Problem 9.11). 

Example 9.2. Gibbs sampling on mixture posterior. Consider a mix- 

iurp nf fliQir'ih'tiionnnQ 
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k 

(9-5) Pj f{x\^j) , 

j = l 

where belongs to an exponential family 

f{x\0 = h{x) exp{C • X - V’(0} 

and ^ is distributed from the associated conjugate prior 

7t(^|q! 0,A) ocexp{A(^-o:o-V'(0)} > A > 0, ao £ X , 

while 

Given a sample (xi, . . . , Xn) from (9.5), we can associate with every observa- 
tion an indicator variable Zi G {1, . . . , A:} that indicates which component of 
the mixture is associated with Xi (see Problems 5.8-5.10). The demarginal- 
ization (or completion) of model (9.5) is then 

~ A4fc(l;pi, . . . ,pfc), Xi\zi ~ f{x\^zi) . 

Thus, considering x* = (x^, Zi) (instead of xi) entirely eliminates the mixture 
structure since the likelihood of the completed model is 

n 

i{p,i\x*,...,x*) oc JJ f{xMzi) 
i=l 
k 

= n n pj' • 

j=l i-,Zi=j 

(This latent structure is also exploited in the original implementation of the 
EM algorithm; see Section 5.3.2.) The two steps of the Gibbs sampler are 



Algorithm A. 34 -Mixture Posterior Simulation- 
1 . Simulate Zi (i ^ 1, . . . , n) from 

P{Zi = j) oc Pj f{xi\^j) (j = l,...,k) 
and compute the statistics 



n, 



” 1 rt jXj — . 



1=1 



i=l 



2. Generate (j = 1, . . . , fc) 



, ft- AjOj + rijXj \ 



[A.34] 



p^'Pkhi + ni,...,7fc + nfc) . 
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As an illustration, consider the same setting as Example 5.19, namely a 
normal mixture with two components with equal known variance and fixed 
weights, 

(9.6) pV(/il,CT^) + (l -p)A/'(/i2,CT^). 

We assume in addition a normal A^(0, lOcr^) prior distribution on both means 
fii and /i 2 - Generating directly from the posterior associated with a sample 
X = (xi, . . . , Xn) from (9.6) quickly turns impossible, as discussed for instance 
in Diebolt and Robert (1994) and Celeux et al. (2000), because of a combina- 
toric explosion in the number of calculations, which grow as 0(2'^). 

As for the EM algorithm (Problems 5.8-5.10), a natural completion of 
(Mi,/^ 2 ) is to introduce the (unobserved) component indicators Zi of the ob- 
servations Xi, namely, 

P{Zi = 1) = 1 - P{Zi =2)=p and Xi\Zi = k r. Af{^ik, • 

The completed distribution is thus 

7r(//i,/i2,z|x) oc exp{-{nl + nl)/20a^} pexp{-{xi - mf/2(j‘^} x 

Zi = l 

(1 - p) exp{-(xi - ii 2 fl 2 a^} . 

Zi=2 

Since jii and /X 2 are independent, given (z,x), with distributions {j = 1,2), 
the conditional distributions are 

\zi=j 

where rij denotes the number of z^’s equal to j. Similarly, the conditional 
distribution of z given (/xi,// 2 ) is a product of binomials, with 

P{Zi = l\Xi,p.i,p.2) 

^ pexp{-(a;i - mf/2a‘^} 

pexp{-(a;i - piiY / 2 a‘^} + (1 -p)exp{-(a;i - /X2)^/2cr2} ' 

Figure 9.1 illustrates the behavior of the Gibbs sampler in that setting, with a 
simulated dataset of 500 points from the .7A^(0, 1) 4- .3A/’(2.7, 1) distribution. 
The representation of the MCMC sample after 15, 000 iterations is quite in 
agreement with the posterior surface, represented via a grid on the (/ii,// 2 ) 
space and some contours; while it may appear to be too concentrated around 
one mode, the second mode represented on this graph is much lower since there 
is a difference of at least 50 in log-posterior values. However, the Gibbs sampler 
may also fail to converge, as described in Diebolt and Robert (1994) and 
illustrated in Figure 12.16. When initialized at a local mode of the likelihood, 
the magnitude of the moves around this mode may be too limited to allow for 
exploration of further modes (in a reasonable number of iterations). || 
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Fig. 9.1. Gibbs sample of 5,000 points for the mixture posterior against the pos- 
terior surface. 



9.1.3 Back to the Slice Sampler 



Just as the two-stage Gibbs sampler can be seen as a limiting case of the slice 
sampler in a three-coordinate problem, the slice sampler can be interpreted 
as a special case of two-stage Gibbs sampler when the joint distribution is the 
uniform distribution on the subgraph ^(/). Prom this point of view, the slice 
sampler starts with fx{x) and creates a joint density f{x^u) = 1(0 < u < 
fx{x)). The associated conditional densities are 



fx\u{x\u) 



I{0 <u< fx{x)) 
Jl{0 < u < fx{x))dx 



and fu\x{y'\x) 



I{0 <u< fxjx)) 
Jl{0 <u < fx{x))du ’ 



which are exactly those used in the slice sampler. Therefore, the X sequence 
is also a Markov chain with transition kernel 



K{x,x') = J fx\u{x'\u)fu\x{u\x)du 



and stationary density fx{x). 

What the slice sampler tells us is that we can induce a Gibbs sampler 
for any marginal distribution fx {x) by creating a joint distribution that is, 
formally, arbitrary. Starting from fx{x)^ we can take any conditional density 
g{y\x) and create a Gibbs sampler with 



fx\Y{x\y) 



9{y\x)fx{x) 

1 9{y\x)fx{x)dx 



and fY\x{y\x) 



9jy\x)fx{x) 

1 9iy\x)fx{x)dy ■ 



9.1.4 The Hammersley-ClifFord Theorem 

A most surprising feature of the Gibbs sampler is that the conditional distri- 
butions contain sufficient information to produce a sample from the joint dis- 
tribution. (This is the case for both two-stage and multi-stage Gibbs; see also 
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Section 10.1.3.) By comparison with maximization problems, this approach 
is akin to maximizing an objective function successively in every direction 
of a given basis. It is well known that this optimization method does not 
necessarily lead to the global maximum, but may end up in a saddlepoint. 

It is, therefore, somewhat remarkable that the full conditional distribu- 
tions perfectly summarize the joint density, although the set of marginal dis- 
tributions obviously fails to do so. The following result then shows that the 
joint density can be directly and constructively derived from the conditional 
densities. 



Theorem 9.3. The joint distribution associated with the conditional densities 
fY\x{y\^) fx\Y{^\y) has the joint density 



f{x,y) 



fY\x{y\x) 



J [fY\x{y\x)/fx\Y{x\y)] dy' 

Proof. Since f(x,y) = fY\x{y\x)fx{x) = fx\Y{x\y)fY{y) we have 



(9.7) 



/ 



fY\x{y\x) 

fx\Y{x\y) 



dy 



/ 



fY(y) 



dy = 



fx{x) ^ fx{xY 



and the result follows. 



□ 



This derivation of /(x, y) obviously requires the existence and computation 
of the integral (9.7). However, this result clearly demonstrates the fundamental 
feature that the two conditionals are sufficiently informative to recover the 
joint density. Note, also, that this theorem makes the implicit assumption 
that the joint density /(x,y) exists. (See Section 10.4.3 for a discussion of 
what happens when this assumption is not satisfied.) 



9.2 Fundamental Properties 

A particularly nice feature of the two-stage Gibbs sampler is that this algo- 
rithm lends itself to componentwise study, because the associated sequences 
(X^^^) and are Markov chains. This decomposition into two Markov 

chains enables us to more thoroughly evaluate the properties of the two-stage 
Gibbs sampler. 

9.2.1 Probabilistic Structures 

We have already seen in Section 9.1 that the individual subchains are both 
Markov chains. We next state their formal properties. 

A sufficient condition for irreducibility of the Gibbs Markov chain is the 
following condition, introduced by Besag (1974) (see also Section 10.1.3). We 
state it in full generality as it also applies to the general case of Chapter 10. 
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Definition 9.4. Let {Yi,Y 2 , . . . , Yp) ~ g{yi , . . . , 2/p), where denotes the 
marginal distribution of Yi. If g^'^\yi) > 0 for every i = 1, . . . ,p, implies that 
g{yi ^ . • . , yp) > 0, then g satisfies the positivity condition. 

Thus, the support of g is the Cartesian product of the supports of the 
Moreover, it follows that the conditional distributions will not reduce 
the range of possible values of Yi when compared with g. In this case, two 
arbitrary Borel subsets of the support can be joined in a single iteration of 
[A. 33]. (Recall that strong irreducihility is introduced in Definition 6.13.) 

Lemma 9.5. Each of the sequences and produced by [^.33] is a 

Markov chain with corresponding stationary distributions 

fx{x) = j f{x,y)dy and fviy) = j f{x,y) dx . 

If the positivity constraint on f holds, then both chains are strongly irreducible. 

Proof. The development of (9.2) shows that each chain is, individually, a 
Markov chain. Under the positivity constraint, if / is positive, fx\Y{^\y) is 
positive on the (projected) support of / and every Borel set of X can be 
visited in a single iteration of [^.33], establishing the strong irreducihility. 
This development also applies to (T^^^). □ 

This elementary reasoning shows, in addition, that if only the chain (X^^^) 
is of interest and if the condition fx\Y{^\y) > 0 holds for every pair (X',T), 
irreducihility is satisfied. As shown further in Section 9.2.3, the “dual” chain 
(y(^)) can be used to establish some probabilistic properties of (X^^'^). 

Convergence of the two-stage Gibbs sampler will follow as a special case of 
the general Gibbs sampler introduced in Chapter 10. However, for complete- 
ness we state the following convergence result here, whose proof follows the 
same lines as Lemma 7.3 and Theorem 7.4. 

Theorem 9.6. Under the positivity condition, if the transition kernel 

K{{x,y),{x' ,y')) = fx\Y{x'\y)fY\x{y'W) 

is absolutely continuous with respect to the dominating measure, the chain 
is Harris recurrent and ergodic with stationary distribution f. 

Theorem 9.6 implies convergence for most two-stage Gibbs samplers since 
the kernel K{x,y) will be absolutely continuous in most setups, and there 
will be no point mass to worry about. In addition, the specific features of the 
two-stage Gibbs sampler induces special convergence properties (interleaving, 
duality) described in the following sections. 

Typical illustrations of the two-stage Gibbs sampler are in missing variable 
models, where one chain usually is on a finite state space. 
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Example 9.7. Grouped counting data. For 360 consecutive time units, 
consider recording the number of passages of individuals, per unit time, past 
some sensor. This can be, for instance, the number of cars observed at a cross- 
road or the number of leucocytes in a region of a blood sample. Hypothetical 
results are given in Table 9.1. This table, therefore, involves a grouping of the 



Number of 


0 12 3 


4 


passages 


or more 


Number of 
observations 


139 128 55 25 


13 



Table 9.1. Frequencies of passage for 360 consecutive observations. 



observations with four passages and more. If we assume that every observation 
is a Poisson V{\) random variable, the likelihood of the model corresponding 
to Table 9.1 is 



/ 3 \ 

e{X\xi,. . . , xs) OC e-347A;^128+55x2+25x3 / ^ _ ^-A ^ j ^ 

for xi = 139, . . . , X5 = 13. 

For 7 t(A) = 1/A and y = (^i, . . . , :?/i3), vector of the 13 units larger 
than 4, it is possible to complete the posterior distribution 7 t(A|xi, . . . , X 5) 
in 7 t(A, 2/1 , ... , yis\xi ^ . . . , X5) and to propose the Gibbs sampling algorithm 
associated with the two components A and y. 

Algorithm A. 35 — Poisson-^Gamma Gibbs Sampler- 
Given , 

1 . Simulate ~ lIj />4 (i = 13) 

2. Simulate [A.35] 

~ Qa ^313 + ^ 360 j . 



Figure 9.2 describes the convergence of the Rao-Blackwellized estimator 
(see Section 4.2 for motivation and Sections 9.3 and 10.4.2 for justification) 



Srb 



1 



^ E A|xi,...,X5,2/i \ 



t=l 



360T 



13 



E 313 + E^i‘M- 



t=l 



i=l 
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Fig. 9.2. Evolution of the estimator Srb against the number of iterations of [A. 35] 
and (insert) histogram of the sample of A^^^’s for 500 iterations. 



to = 1.0224, the Bayes estimator of A, along with an histogram of the A^^^’s 
simulated from [.A. 35]. The convergence to the Bayes estimator is particularly 
fast in this case, a fact related to the feature that Srb only depends on the 
finite state-space chain || 



Example 9.8. Grouped multinomial data. Tanner and Wong (1987) con- 
sider the multinomial model 

X ~ Ms (n; ai/j, -h 5i, a 2 /i -h 62 , a^r] -f 63 , a^r] -h 64 , c(l - /x - rj )) , 



with 

4 

0 < ai -f U2 = as -h U4 == 1 — = c < 1, 

2=1 

where the ai^bi > 0 are known, based on genetic considerations. This model 
is equivalent to a sampling from 

Y ~ A4g {n-,aifi,bi,a2H,b2,a3T],b3,a4r],b4,c{l - ^ - 77)) , 

where some observations are aggregated (and thus missing), 



Xi = ri + Y2, X 2 = Y3 + n, X3 = n + Is, X 4 = Yj + Ys, X3 = Fg. 
A natural prior distribution on (/x, rj) is the Dirichlet prior P(oi, ag, 03 ), 
n{n,ri) oc - 7? - 
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where a\ — ol 2 — olz — corresponds to the noninformative case. The 
posterior distributions 7 t(/x|x) and 7r{rj\x) are not readily available, as shown 
in Problem 9.10. If we define Z = (Zi, Z 2 , Z 3 , Z 4 ) = ( 11 , 13 , 15 , F 7 ), the com- 
pleted posterior distribution can be defined as 



= TT(r],n\y) 

oc -r)~ {I - rj - /x)®^ 



Thus, 



{n,T],l- II- T}) \x, z ~ V{zi + Z 2 + ai,Z 3 + Z 4 + a 2 ,X 5 + a 3 ). 
Moreover, 



Zi\x,iJ.,r] B[Xi, — ) (x = l,2), 

V aiH + bij 

Zi\x,ii,r] r-. B (xi, (i = 3,4). 

V aiT] + biJ 

Therefore, 9 = (/i, 77), this completion provides manageable conditional dis- 
tributions gi{6\y) and g 2 {z\x, 0 ). || 



Example 9.9. Capture-recapture uniform model. In the setup of 
capture-recapture models seen in Examples 2.25 and 5.22, the simplest case 
corresponds to the setting where the size N of the entire population is un- 
known and each individual has a probability p of being captured in every 
capture experiment, whatever its past history. For two successive captures, 
a sufficient statistic is the triplet (nn, nio, noi), where rin is the number of 
individuals which have been captured twice, nio the number of individuals 
captured only in the first experiment, and noi the number of individuals cap- 
tured only in the second experiment. Writing n = nio + noi + 2 nn, the total 
number of captures, the likelihood of this model is given by 

£(A^,p|nii,nio,noi) oc ( ^ 

V^ll ^10 ^01 / 

Castledine (1981) calls this a uniform likelihood (see also Wolter 1986 or 
George and Robert 1992). 

It is easy to see that the likelihood function factors through n and n' = 
^10 + ^11 + ^ 01 , the total number of different individuals captured in the 
experiments. If 7 r(p, N) corresponds to a Poisson distribution V{X) on N and 
a uniform distribution ^o,l] on 



m 

{N-n')\ 



p"(i 



e-^A^ 



Tr{p,N\n,n') oc 



m 
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implies that 

(iV-n')|p,n,n'-P(A) , 

p\N, n, n' ~ Be{n + 1, iV - n + 1) , 

and therefore that the two-stage Gibbs sampling is available in this setup 
(even though direct computation is possible; see Problems 9.12 and 9.13). || 



9.2.2 Reversible and Interleaving Chains 

The two-stage Gibbs sampler was shown by Liu et al. (1994) to have a very 
strong structural property. The two (sub)chains, called, say, the chain of in- 
terest (X^^^), and the instrumental (or dual) chain^ (^^^^), satisfy a duality 
property they call interleaving. This property is mostly characteristic of two- 
stage Gibbs samplers. 

Definition 9.10. Two Markov chains (X^^^) and (T^^^) are said to be conju- 
gate to each other with the interleaving property (or interleaved) if 

(i) and are independent conditionally on 

(ii) and are independent conditionally on and 

(iii) (X(^\ and {X^^\Y^^^) are identically distributed under stationar- 
ity. 

In most cases where we have interleaved chains, there is more interest in 
one of the chains. That the property of interleaving is always satisfied by two- 
stage Gibbs samplers is immediate, as shown below. Note that the (global) 
chain (X^^^T^^^) is not necessarily (time-) reversible. 

Lemma 9.11. Each of the chains (X^^^) and (Y^^^) generated by a tvjo-stage 
Gibbs sampling algorithm is reversible, and the chain (X^^\X^^^) satisfies the 
interleaving property. 

Proof. We first establish reversibility for each chain, a property that is inde- 
pendent of interleaving: it follows from the detailed balance condition (The- 
orem 6.46). Consider (Xo,lo) distributed from the stationary distribution g, 
with respective marginal distributions g^"^\x) and g^^\y). Then, if 

Ki{xo,xi) = j g2{yo\xo)gi{xi\yo)dyo 

denotes the transition kernel of (X^^^), 
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g^'^\xo) Ki{xo,xi) = g^‘^\xo) j g2{yo\xo) g\(xi\yo) dyo 
= J g{xo,yo) gi{xi\yo) dyo 

= J g{xo,yo)dyo] 

f f g2{yo\xi) g^'^\xi) 

= j ^(I)M 

’/IS, g2{yo\xi)g^‘^\xiY 
= g^'^Hxi) Ki{xi,xo), 

where the last equality follows by integration. Thus, the reversibility of the 
chain is established, and a similar argument applies for the reversibility 
of the chain; that is, if K 2 denotes the associated kernel, the detailed 

balance condition 



g^^\yo) K2{yo,yi) = g^^\yi) K2{yi,yo) 



is satisfied. 

Turning to the interleaving property of Definition 9.10, the construction of 
each chain establishes (i) and (ii) directly. To see that property (iii) is satisfied, 
we note that the joint cdf of Xq and Yq is 



P{Xo < x,Yo <y) = 




g{y, u)dudv 



and the joint cdf of Xi and Yq is 



P{X^<x,YQ<y) = 




gi {v\u)g^^^ {u)dudv. 



Since 



gi{v\u)g^^\u) = J gi{v\u)g{v',u)dv' 

= j gi{v\u)gi{v\u)g^^\u)dv' 

= g{x,u), 

the result follows, and and are interleaved chains. □ 

To ensure that the entire chain (X^^\ is reversible, an additional step 
is necessary in [X.33]. 
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Algorithm A *36 -Reversible Two- Stage Gibbs- 
Given ^ 2 ^^ j 
1* Simulate ^ 

2. Simulate ~ 32 ( 3 / 2 !^); [^.36] 

3. Simulate y/ ~ . 



In this case, 

9i{w\y2) 92{y2\w) dw 9i{y[\y'2) 
g(w, j/ 2 ) dw gi(yi\y 2 ) 

9i{w\V2) 92 {y 2 \w) dw gi{yi\y 2 ) 

is distributed as {Y{^Y 2 ,Yi^Y 2 ). 

9.2.3 The Duality Principle 

The well-known concept of Rao-Blackwellization (Sections 4.2, 7.6.2, and to 
come in Section 9.3) is based on the fact that a conditioning argument, using 
variables other than those directly of interest, can result in an improved proce- 
dure. In the case of interleaving Markov chains, this phenomenon goes deeper. 
We will see in this section that this phenomenon is more fundamental than a 
mere improvement of the variance of some estimators, as it provides a general 
technique to establish convergence properties for the chain of interest 
based on the instrumental chain even when the latter is unrelated with 

the inferential problem. Diebolt and Robert (1993, 1994) have called this use 
of the dual chain the Duality Principle when they used it in the setup of 
mixtures of distributions. While it has been introduced in a two- stage Gibbs 
sampling setup, this principle extends to other Gibbs sampling methods since 
it is sufficient that the chain of interest (X^^^) be generated conditionally on 
another chain (T^^^), which supplies the probabilistic properties. 

Theorem 9.12. Consider a Markov chain (X^^^) and a sequence (T^^^) of 
random variables generated from the conditional distributions 

~ 'K{x\y^*'">) , ~ f{y\x^*'\y''^^) ■ 

If the chain (X^^^) is ergodic (geometrically or uniformly ergodic) and ^/7t^(o) 

denotes the distribution of (X^^^) associated with the initial value y^^\ the 
norm ||7t^(o) —T^Wry goes to 0 when t goes to infinity (goes 0 a geometric 
or uniformly bounded rate). 



{Y^,Y 2 ,Y(,y 2 ) - 9{yi,y2) I 
= 9{y[,y'2) j 
= 9{y'i,y'2) j 
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Proof. The transition kernel is 



K{y,y')= n{x\y) f{y'\x,y) dx. 

J X 

If fl(o){y) denotes the marginal density of then the marginal of is 

)i^) = J '^ix\y) flw)(y) dy 






and convergence in the chain can be tied to convergence in the 

chain by the following. The total variation in the chain is 



^Wtv = ^ ^ 7 t‘( 0 ) (a;) - 7 r(a;) 

= \ [ [ (flmiy) - f{y)) dy 

/ fy(o)iy)-fiy) TT{x\y)dxdy 
^ JxxV 



dx 



\\ft( 0 ) - fWrv, 



J yK' 

SO the convergence properties of ||/^(o) 



/II TV can be transfered to ||7 t^(o) — 



7t||tv. Both sequences have in the same speed of convergence since 
ll/‘m “ flWv < ||7r‘(o) - 'kWtv < ||/‘(o) - fllrv- 



□ 

Note that this setting contains as a particular case hidden Markov mod- 
els, where ~ f{y\y^^^) is not observed. These models will be 

detailed in Section 14.3.2. 

The duality principle is even easier to state, and stronger, when is 

a finite state-space Markov chain. 

Theorem 9.13. If(Y^^^) is a finite state-space Markov chain, with state-space 
y , such that 



P(y(*+1) = fc|yW,a;) >0, \/key,\fxeX, 

the sequence derived from by the transition w{x\y^*'^) is uniformly 

ergodic. 

Notice that this convergence result does not impose any constraint on the 
transition 7 t(t|^), which, for instance, is not necessarily everywhere positive. 
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Proof. The result follows from Theorem 9.12. First, write 

Pij = j =j\x,y^^^ = i) 7r(rc|y(‘) = i) dx, i,j e y. 

If we define the lower bound p = min^^j^^ pij > 0, then 

P(y(‘) = i\y(0) =i)>p = k\y^°^ =i)=p, 

key 



and hence 

oo 

^ P(y(0 = i\yi0) = i) = oc 
t=i 

for every state i of 3^. Thus, the chain is positive recurrent with limiting 
distribution 

Qi = lim = i). 

t—^oo 

Uniform ergodicity follows from the finiteness of 3^ (see, e.g., Billingsley 1968) 

□ 

The statement of Theorems 9.12 and 9.13 is complicated by the fact that 
is not necessarily a Markov chain and, therefore, the notions of ergodic- 
ity and of geometric ergodicity do not apply to this sequence. However, there 
still exists a limiting distribution tt for obtained by transformation of 

the limiting distribution of by the transition 7r{x\y). The particular case 

of two-stage Gibbs sampling allows for a less complicated formulation. 

Corollary 9.14. For two interleaved Markov chains, and if 

(X^^^) is ergodic (geometrically ergodic), then is ergodic (geometrically 

ergodic). 



Example 9.15. (Continuation of Example 9.8) The vector (zi, . . . , z^) 
takes its values in a finite space of size (xi -h 1) x {x 2 -h 1) x (xs -h 1) x -h 1) 
and the transition is strictly positive. Corollary 9.14 therefore implies that the 
chain is uniformly ergodic. || 



Finally, we turn to the question of rate of convergence, and find that the 
duality principle still applies, and the rates transfer between chains. 

Proposition 9.16. If is geometrically convergent with compact state- 

space and with convergence rate p, there exists Ch such that 



(9.8) 






for every function h G Ci{'k{'\x)) uniformly in y^^^ G y. 
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Proof. For h{x) = , hd{x)), 

d 2 

i=l 

= Yl(^J hi{x){Tr\x\y^°^) -7r{x)} dx'^ 

= Y1 (^J ^ix\y) j {f(y\y^°'') - f(y)} dydx 
d 

< sup E[|/ii(X)| |y]2 4 ||/‘( 0 ) - fW^v 
i=i y^y 

< 4d max sup E[|/ii(X)| |y]2 \\f* - fWrv ■ 

* yey 



Unfortunately, this result has rather limited consequences for the study of 
the sequence since (9.8) is a average property of while MCMC 

algorithms such as [A. 33] only produce a single realization from << 0 ). We 
will see in Section 12.2.3 a more practical implication of the Duality Principle 
since Theorem 9.12 may allow a control of convergence by renewal. 



9.3 Monotone Covariance and Rao-Blackwellization 

In the spirit of the Duality Principle, Rao-Blackwellization exhibits an in- 
teresting difference between statistical perspectives and simulation practice, 
in the sense that the approximations used in the estimator do not (directly) 
involve the chain of interest. As shown in Section 4.2 and Section 7.6.2, con- 
ditioning on a subset of the simulated variables may produce considerable 
improvement upon the standard empirical estimator in terms of variance, by 
a simple “recycling” of the rejected variables (see also Section 3.3.3). Two- 
stage Gibbs sampling and its generalization of Chapter 10 do not permit this 
kind of recycling since every simulated value is accepted (Theorem 10.13). 
Nonetheless, Gelfand and Smith (1990) propose a type of conditioning chris- 
tened Rao-Blackwellization in connection with the Rao-Blackwell Theorem 
(see Lehmann and Casella 1998, Section 1.7) and defined as parametric Rao- 
Blackwellization by Casella and Robert (1996) to differentiate from the form 
studied in Sections 4.2 and 7.6.2. 

For Y = (Yi,y2) ~ ^(y 1,2/2), Rao-Blackwellization is based on the 
marginalization identity 
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9i{yi\y2) 9 ^^\v 2) dy 2 . 

It thus replaces 




with 

<5,6 = 1 ^ E [h{Y,)\y^^^' . 

t=l 

Both estimators converge to E[h{Yi)] and, under the stationary distribu- 
tion, they are both unbiased. A simple application of the identity var({7) = 
var(E[t/|y]) -hE[var(f/|F)] implies that 

(9.9) var (e < va.v{h{Yi)). 

This led Gelfand and Smith (1990) to suggest the use of Srb instead of 5 q. 
However, inequality (9.9) is insufficient to conclude on the domination of 6rb 
when compared with as it fails to take into account the correlation between 
the y^^^’s. The domination of (5 q by 6rb can therefore be established in only a 
few cases; Liu et al. (1994) show in particular that it holds for the two-stage 
Gibbs sampler. (See also Geyer 1995 for necessary conditions.) 

We establish the domination result in Theorem 9.19, but we first need 
some preliminary results, beginning with a representation lemma yielding the 
interesting result that covariances are positive in an interleaved chain. 

Lemma 9.17. If h G >^ 2 (^ 2 ) if (X^^^) is interleaved with then 

cov (/i(y(i>),/i(y(2))^ ^ var(E[/i(y)|V]). 

Proof Assuming, without loss of generality, that 'Eg^[h{Y)] = 0, 

cov = E [h{Y<^^'>)h{Y^^'>)^ 

= E{E[/i(y(i))|X(2)j E /i(y(2^)|X<2)]| 

= E |e = var(E[/i(y)|V]) , 

where the second equality follows from iterating the expectation and using 
the conditional independence of the interleaving property. The last equality 
uses reversibility (that is, condition (iii)) of the interleaved chains. □ 

Proposition 9.18. If (Y^^^) is a Markov chain with the interleaving property , 
the covariances 

cov(/i(y(i)),/i(yW)) 

are positive and decreasing in t for every h G >^ 2 (^ 2 )- 
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Proof. Lemma 9.17 implies, by induction, that 

cov = E [E[/i(y)|x(2)] E[/i(y)|x(*)]] 

(9.10) = var(E[- • • E[E[/i(y)|X]|y] ••■]), 

where the last term involves (t — 1) conditional expectations, alternatively in Y 
and in X. The decrease in t directly follows from the inequality on conditional 
expectations, by virtue of the representation (9.10) and the inequality (9.9). 

□ 

The result on the improvement brought by Rao-Blackwellization then eas- 
ily follows from Proposition 9.18. 

Theorem 9.19. If and are two interleaved Markov chains, with 

stationary distributions fx and fy, respectively, the estimator Srt dominates 
the estimator (5q for every function h with finite variance under both fx and 

fy- 

Proof Again assuming E[h{X)] = 0, and introducing the estimators 

T T 

(9.11) ^ ^ = E[/i(X)|y(‘)], 

t=l t=l 

it follows that 

var(5o) = fiY cov(/i(X(‘)),/i.(X(*'))) 

t,t' 

(9.12) =T ^ var(E[-.-E[ft(X)|y]...]) 

t,t' 

and 

va.v{Srb) = cov(E[/i(X)|y«],E[/i(X)|y(*')])) 

(9.13) = ^ E var(E[.--E[E[/i(X)|y]|X] ••■]), 

according to the proof of Proposition 9.18, with \t—t'\ conditional expectations 
in the general term of (9.12) and - P| -h 1 in the general term of (9.13). It 
is then sufficient to compare var((5o) with var(J^ 5 ) term by term to conclude 
that var((5o) ^ var(J^b). □ 

One might question whether Rao-Blackwellization will always result in 
an appreciable variance reduction, even as the sample size (or the number of 
Monte Carlo iterations) increases. This point was addressed by Levine (1996), 
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who formulated this problem in terms of the asymptotic relative efficiency 
(ARE) of 5o with respect to its Rao-Blackwellized version Srb, given in (9.11), 
where the pairs (X^^\y^^^) are generated from a bivariate Gibbs sampler. 
The ARE is a ratio of the variances of the limiting distributions for the two 
estimators, which are given by 

oo 

(9.14) al = var(/i(XW)) + 2j];cov(/i(xW),/i(XW)) 

k=l 

and 

= var(E[/i(X)|y]) 

OO 

(9.15) + 2 cov 

k=l 

Levine (1996) established that the ratio cr? /cr? >1, with equality if and only 

ifw()>U)) = c«v(E|/.(X)|y|)=0. " ■■ 

Example 9.20. (Continuation of Example 9.1) For the Gibbs sampler 
(9.4), it can be shown (Problem 9.5) that coy{X^^\ for all A:, 

and 

^ 2/2 _ ^ ^ 1 

So, if p is small, the amount of improvement, which is independent of the 
number of iterations, can be substantial. 1| 
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As mentioned earlier, the EM algorithm can be seen as a precursor of the 
two-stage Gibbs sampler in missing data models (Section 5.3.1), in that it 
similarly exploits the conditional distribution of the missing variables. The 
connection goes further, as seen below. 

Recall from Section 5.3.2 that, if X ~ g{x\0) is the observed data, and we 
augment the data with z, where Z ~ /(x, z|^), then we have the complete-data 
and incomplete-data likelihoods 

L"^(6>|x, z) = /(x, z\e) and L{0\x) = g{x\9 ) , 



with the missing data density 



/c(z|x, 6 ) 



L‘^{0\x,z) 

L(6l|x) 
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If we can normalize the complete-data likelihood in 6 (this is the only condition 
for the equivalence mentioned above), that is, if f L^(0jx,z)dO < oo, then 
define 



L*(0lx,z) 



L^(^jx,z) 
f L^(0lx,z)dd 



and create the two-stage Gibbs sampler: 



(9.16) 



1. z|^ ~ A:(z|x, 

2. 0|z ~ L*(0|x, z). 



Note the direct connection to an EM algorithm based on and k. The “E” 
step in the EM algorithm calculates the expected value of the log-likelihood 
over z, often by calculating E(Z|x, and substituting in the log-likelihood. 
In the Gibbs sampler this step is replaced with generating a random variable 
from the density k. The “M” step of the EM algorithm then takes as the 
current value of 9 the maximum of the expected complete-data log-likelihood. 
In the Gibbs sampler this step is replaced by generating a value of 9 from L * , 
the normalized complete-data likelihood. 

Example 9.21. Censored data Gibbs. For the censored data example con- 
sidered in Example 5.14, the distribution of the missing data is 

(pjz - 6) 

and the distribution of ^|x, z is 

m n 

2=1 2=m+l 



which corresponds to a 



f mx + (n — m)z 1 \ 

\ n ’ n/ 

distribution, and so we immediately have that L* exists and that we can run 
a Gibbs sampler (Problem 9.14). || 



The validity of the “EM/Gibbs” sampler follows in a straightforward man- 
ner from its construction. The transition kernel of the Markov chain is 

K(9^9'\x)= f A:(z|x, ^)L*(^'|x, z) dz 

J z 

and it can be shown (Problem 9.15) that the invariant distribution of the 
chain is the incomplete data likelihood, that is. 
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Fig. 9.3. Gibbs output for cellular phone data, 5000 iterations 



L((9'|x)= [ ic{e,e'\^)L{e\^)de. 

Je 

Since L(^'|x, z) is integrable in 9, so is L(0'|x), and hence the invariant distri- 
bution is a proper density. So the Markov chain is positive, and convergence 
follows from Theorem 9.6. 

Example 9.22. Cellular phone Gibbs. As an illustration of the EM-Gibbs 
connection, we revisit Example 5.18, but now we use the Gibbs sampler to get 
our solution. From the complete data likelihood (5.18) and the missing data 
distribution (5.19) we have (Problem 9.16) 



p\WuW2,...,W5,J2X^ 

i 



P ( ITi + 1 , W2 + 1 , . . . , 14^5 + ^ - 1 ) 

% 



(9.17) 



^ ~ Meg I + m, 1 - 




The results of the Gibbs iterations are shown in Figure 9.3. The point es- 
timates agree with those of the EM algorithm (Example 5.18), p = (0.258, 
0.313,0.140,0.118,0.170), with the exception of ps, which is larger than the 
MLE. This may reflect the fact that the Gibbs estimate is a mean (and gets 
pulled a bit into the tail), while the MLE is a mode. Measures of error can be 
obtained from either the iterations or the histograms. 1| 



Based on the same functions L(^|y, z) and k{z\0,y) the EM algorithm 
will get the ML estimator from L(^|y), whereas the Gibbs sampler will get us 
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the entire function. This likelihood implementation of the Gibbs sampler was 
used by Casella and Berger (1994) and is also described by Smith and Roberts 
(1993). A version of the EM algorithm, where the Markov chain connection is 
quite apparent, was given by Baum and Petrie (1966) and Baum et al. (1970). 

9.5 Transition 

As a natural extension of the (2D) slice sampler, the two-stage Gibbs sampler 
enjoys many optimality properties. In Chapter 8 we also developed the natural 
extension of the slice sampler to the case when the density / is easier to study 
when decomposed as a product of functions fi (Section 8.2). Chapter 10 will 
provide the corresponding Gibbs generalization to this general slice sampler, 
which will cover cases when the decomposition of the simulation in two con- 
ditional simulations is not feasible any longer. This generalization obviously 
leads to a wider range of models, but also to fewer optimality properties. 

As an introduction to the next chapter, note that if (X, Y) — (A, (Yi, I 2 )), 
and if simulating from /y|x is not directly possible, the two-stage Gibbs sam- 
pler could also be applied to this (conditional) density. This means that a se- 
quence of successive simulations from /(yi|x, 2 / 2 ) and from f{y 2 \x,yi) (which 
is the translation of the two-stage Gibbs sampler for the conditional fy\x) 
converges to a simulation from fY\x{yi^y 2 \x). 

The fundamental property used in Chapter 10 is that stopping the recur- 
sion between /(yi|x, 2 / 2 ) and f{y 2 \x^yi) “before” convergence has no eflFect on 
the validation of the algorithm. In fact, a single simulation of each component 
is sufficient! Obviously, this feature will generalize (by a cascading argument) 
to an arbitrary number of components in Y. 



9.6 Problems 

9.1 For the Gibbs sampler [A. 33]: 

(a) Show that the sequence (Xi, Y) is a Markov chain, as is each sequence (Ai) 
and (Y). 

(b) Show that fx{-) and /y(-) are respectively the invariant densities of the A 
and Y sequences of [A. 33]. 

9.2 Write a Gibbs sampler to generate standard bivariate normal random variables 
(with mean 0, variance 1 and correlation p). (Recall that if (A, Y) is standard 
bivariate normal, the conditional density of A|Y = y is N{py, (1 - p^)). For 
p = .3, use the generated random variables to estimate the density of A^ -h Y^ 
and calculate P(A^ + Y^ > 2). 

9.3 Referring to Problem 5.18, estimate pa,Pb and po using a Gibbs sampler. Make 
a histogram of the samples. 

9.4 In the case of the two-stage Gibbs sampler, the relationship between the Gibbs 
and Metropolis-Ha^tings algorithms becomes particularly clear. If we have the 
bivariate Gibbs sampler A ~ f{x\y) and Y ~ f{y\x), consider the A chain 
alone and show: 
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(a) K{x,x') = g{x\x') = f f{x'\y)f{y\x)dy', 

(b) Q = min | ’ where /(•) is the marginal distribution; 

(c) f{x)/g{x'\x) — f{x)/g(x\x), so ^ = 1 and the Metropolis-Hastings pro- 
posal is always accepted. 

9.5 For the situation of Example 9.20: 

(a) Verify the variance representations (9.14) and (9.15). 

(b) For the bivariate normal sampler, show that cov(Vi, Xk) — for all /c, 

and = 1/p^ 

9.6 The monotone decrease of the correlation seen in Section 9.3 does not hold 
uniformly for all Gibbs samplers as shown by the example of Liu et al. (1994): 
For the bivariate normal Gibbs sampler of Example 9.20, and h(x, y) = x — y, 
show that 

cov[h{Xi,Yi),h{X2,Y2)] = -p{l-pf < 0. 

9.7 Refer to the Horse Kick data of Table 2.1. Fit the loglinear model of Example 9.7 

using the bivariate Gibbs sampler, along with the ARS algorithm, and estimate 
both 7r(a|x,y) and 7r(5|x,y). Obtain both point estimates and error bounds for 
a and b. Take == 5. 

9.8 The data of Example 9.7 can also be analyzed as a loglinear model, v^^here we 
fit log A = a + where t = number of passages. 

(a) Using the techniques of Example 2.26, find the posterior distributions of a 
and b. Compare you answer to that given in Example 9.7. (Ignore the “4 or 
more”, and just use the category as 4.) 

(b) Find the posterior distributions of a and 6, but now take the “4 or more” 
censoring into account, as in Example 9.7. 

9.9 The situation of Example 9.7 also lends itself to the EM algorithm, similar to 
the Gibbs treatment (Example 9.8) and EM treatment (Example 5.21) of the 
grouped multinomial data problem. For the data of Table 9.1: 

(a) Use the EM algorithm to calculate the MLE of A. 

(b) Compare your answer in part (a) to that from the Gibbs sampler of Algo- 
rithm [A. 35]. 

(c) Establish that the Rao-Blackwellized estimator is correct. 

9.10 In the setup of Example 9.8, the (uncompleted) posterior distribution is avail- 
able as 



7r{r],fi\x) oc (ai/i + 5i)"'^(a2/i + 62)"^^(a3?? + 53)"^^(a4r/ + 64)"^^ 

(a) Show that the marginal distributions tt{p\x) and 7r(77|a;) can be explicitly 
computed as polynomials when the aFs are integers. 

(b) Give the marginal posterior distribution of^ = fx/{l — y — i^l). (Note: See 
Robert 1995a for a solution.) 

(c) Evaluate the Gibbs sampler proposed in Example 9.8 by comparing approx- 
imate moments of /i, 77, and ^ with their exact counterpart, derived from 
the explicit marginal. 

9.11 There is a connection between the EM algorithm and Gibbs sampling, in that 
both have their basis in Markov chain theory. One way of seeing this is to show 
that the incomplete-data likelihood is a solution to the integral equation of 
successive substitution sampling and that Gibbs sampling can then be used to 
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calculate the likelihood function. U L{0\y) is the incomplete-data likelihood and 
I/(^|y,z) is the complete-data likelihood, define 






my) 

jL{9\y)de' 



L*{9\y,z) 



my,^) 
imy,^)de ’ 



assuming both integrals to be finite. 

(a) Show that L*(^|y) is the solution to 



L^e\y) 




L*(6»|y,z)fc(z|6>',y)dz 



L*{9'\y)de', 



where k(z\0,y) = L(0\y,z)/L{9\y). 

(b) Show that the sequence 0(j) from the Gibbs iteration 



9(j) ~ I-*(6>|y,Z(j_i)) , 
Zy) ~ fc(z|6>y),y) , 



converges to a random variable with density L*{0\y) as j goes oo. How can 
this be used to compute the likelihood function L{0\y)7 
(Note: Based on the same functions L(^|y,z) and k{z\6,y) the EM algorithm 
will get the ML estimator from L(^|y), whereas the Gibbs sampler will get us 
the entire function. This likelihood implementation of the Gibbs sampler was 
used by Casella and Berger 1994 and is also described by Smith and Roberts 
1993. A version of the EM algorithm, where the Markov chain connection is 
quite apparent, was given by Baum and Petrie 1966 and Baum et al. 1971.) 

9.12 In the setup of Example 9.9, the posterior distribution of N can be evaluated 
by recursion. 

(a) Show that 



7t{N) OC 



{N - no)m 
N\{N-nt)\ ■ 



(b) Using the ratio 7r{N)/7r{N — 1), derive a recursion relation to compute 
E’" [iV|no,nt]. 

(c) In the case no = 112, rit = 79, and A = 500, compare the computation time 
of the above device with the computation time of the Gibbs sampler. {Note: 
See George and Robert 1992 for details.) 

9.13 Recall that, in the setting of Example 5.22, animal i, i = 1, 2, . . . ,n may be 
captured at time j, j = 1, 2, . . . , t, in one of m locations, where the location is 
a multinomial random variable Hij ~ , ^m). Given Hij = k {k = 

1,2, ...,m), the animal is captured with probability pk, represented by the 
random variable X B{pk)- Define yijk = l{hij = k)I{xijk = 1). 

(a) Under the conjugate priors 



(6>i, . . . ,0m) ~ T>(Ai, . . . , Am) and pk ~ Be{a,(3 ) , 



show that the full conditional posterior distributions are given by 
{^ 1 , . . . , 6m} ^ T>{\\ -h . . . , Am + 

and 

Pk ~ Se(a + Ei=iE]^xXijk, (3 + n- 
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(b) Deduce that all of the full conditionals are conjugate and thus that the 
Gibbs sampler is straightforward to implement. 

(c) For the data of Problem 5.28, estimate the 6i^s and the pi’s using the Gibbs 
sampler starting from the prior parameter values a = (3 = 5 and Xi = 2. 

{Note: Dupuis 1995 and Scherrer 1997 discuss how to choose the prior parameter 

values to reflect the anticipated movement of the animals.) 

9.14 Referring to Example 9.21: 

(a) Show that, as a function of 6, the normalized complete-data likelihood is 
N{{mx -h (n — m)z)ln^ 1/^)- 

(b) Derive a Monte Carlo EM algorithm to estimate 6. 

(c) Contrast the Gibbs sampler algorithm with the EM algorithm of Example 
9.21 and the Monte Carlo EM algorithm of part (b). 

9.15 Referring to Section 9.4: 

(a) Show that the Gibbs sampler of (9.16) has L{9\x) as stationary distribution. 

(b) Show that if L(^|x, z) is integrable in 0, then so is L(^|x), and hence the 
Markov chain of part (a) is positive. 

(c) Complete the proof of ergodicity of the Markov chain. (Hint: In addition 
to Theorem 10.10, see Theorem 6.51. Theorem 9.12 may also be useful in 
some situations.) 

9.16 Referring to Example 9.22: 

(a) Verify the distributions in (9.17) for the Gibbs sampler. 

(b) Compare the output of the Gibbs sampler to the EM algorithm of Example 
5.18. Which algorithm do you prefer and why? 

9.17 (Smith and Gelfand 1992) For i — 1,2, 3, consider Yi = Xu + X 2 i, with 

Xu ^ I3(nii, 0i), X2i ^ S(n2i, 02)- 

(1) Give the likelihood L(6>i, ^ 2 ) for nu = 5, 6, 4, U 2 i = 5, 4, 6, and yi 7, 5, 6. 

(2) For a uniform prior on (^ 1 ,^ 2 ), derive the Gibbs sampler based on the 
natural parameterization. 

(3) Examine whether an alternative parameterization or a Metropolis-Hastings 
algorithm may speed up convergence. 

9.18 For the Gibbs sampler 

X I 2/ ~ V(py, 1 - P^) , 

y |x~V(px, 1-p^), 



of Example 9.1: 

(a) Show that for the X chain, the transition kernel is 

= 2^(1 - p 2 ) / 

(b) Show that X ~ A?(0, 1) is the invariant distribution of the X chain. 

(c) Show that X\x* ~ J\f{p^x*,l — p^). {Hint Complete the square in the 
exponent of part (a).) 

(d) Show that we can write Xk = p^Xk-i + Uk, k — 1,2 ,. . ., where the Uk 
are iid A/’(0, 1 — p^) and that cov(Xo, Xk) = for all k. Deduce that the 
covariances go to zero. 
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9.19 In the setup of Example 9.23, show that the likelihood function is not bounded 
and deduce that, formally, there is no maximum likelihood estimator. {Hint: 
Take iijq = Xi^ and let Tj^ go to 0.) 

9.20 A model consists in the partial observation of normal vectors Z = {X,Y) ~ 
J\f2{0,X) according to a mechanism of random censoring. The corresponding 
data are given in Table 9.2. 



X 1.17 -0.98 0.18 0.57 0.21 - - - 

y 0.34 -1.24 -0.13 - - -0.12 -0.83 1.64 



Table 9.2. Independent observations of Z = (A, Y) ~ A/2(0, Y) with missing data 
(denoted — ). 



(a) Show that inference can formally be based on the likelihood 






(x^-\-x^) j’2crf ^ — 3 ^-(yQ-{-yj+yg) / 2 a 2 



(b) Show that the choice of the prior distribution 7t{Y) oc \Y\~^ leads to diffi- 
culties given that and <72 are isolated in the likelihood. 

(c) Show that the missing components can be simulated through the following 
algorithm. 

Algorithm A. 37 -Normal Completion- 



1. Simulate 






(i = 6,7,8), 




II 


2. Generate 

'v w2(8, 




S 

with X = ^ the dispersion matrix of the completed data. 

i=i 



to derive the posterior distribution of the quantity of interest, p. 



(d) Propose a Metropolis-Hastings alternative based on a slice sampler. 

9.21 Roberts and Rosenthal (2003) derive the polar slice sampler from the decom- 
position 7t{x) (X fo{x) fi{x) of a target distribution tt{x). 

(a) Show that the Gibbs sampler whose two steps are (i) to simulate U uni- 
formly on (0, fi{x) and (ii) to simulate X from fo{x) restricted to the set of 
x’s such that fi{x) > u is a valid MCMC algorithm with target distribution 

7T. 

(b) Show that, if /o is constant, this algorithm is simply the slice sampler of 
Algorithm [A. 31] in Chapter 8. (It is also called the uniform slice sampler 
in Roberts and Rosenthal 2003.) 
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(c) When x G using the case fo{x) = and fi{x) = \x\^~^ 7t{x) is 

called the polar slice sampler. Show that using the polar slice seimpler on 
a d-dimensional log-concave target density is equivalent to a uniform slice 
sampler on a corresponding one-dimensional log-concave density. 

(d) Illustrate this property in the case when 7t{x) = exp— |x|. 

9.22 Consider the case of a mixture of normal distributions, 



/(») 



k 

E 

j=i 



Pj 



-(x-aXj)^/(2t?) 

Tj 



(a) Show that the conjugate distribution on (pj^Tj) is 







(b) Show that two valid steps of the Gibbs sampler are as follows. 

Algorithm A, 38 -Normal Mixture Posterior Simulation- 
1 . Simulate (i = 1, . , . , xi) 

Zi ~ P(Zi =j)(xpj exp {-{si - 

and compute the statistics {j = 1, . . . , fc) [A. 38] 

n n n 



2. Generate 



\ ^ j H" H" ^ j J 

^ Aj + nj +3 Ajg| - (A, + +njXjf '^ 



rf 



p ^ - 1 + nk) 



{Note: Robert and Soubiran 1993 use this algorithm to derive the maximum 
likelihood estimators by recursive integration (see Section 5.2.4), showing 
that the Bayes estimators converge to the local maximum, which is the 
closest to the initial Bayes estimator.) 

9.23 Referring to Section 9.7.2, for the factor ARCH model of (9.18): 

(a) Propose a noninformative prior distribution on the parameter 6 = (a, /?, a, E) 
that leads to a proper posterior distribution. 

(b) Propose a completion step for the latent variables based on f{y* \yt^yt-u^)> 
{Note: See Diebold and Nerlove 1989, Gourieroux et al. 1993, Kim et al. 1998, 
and Billio et al. 1998 for different estimation approaches to this model.) 

9.24 Check whether a negative coefficient b in the random walk Yt = a-\- b{X^^^ — 

a) -f Zt induces a negative correlation between the Extend to the case 

where the random walk has an ARCH structure. 



Yt = a + - a) + exp(c + - af)Zt. 





366 9 The Two-Stage Gibbs Sampler 



9.25 (Diebolt and Robert 1990a, b) 

(a) Show that, for a mixture of distributions from exponential families, there 
exist conjugate priors. {Hint: See Example 9.2.) 

(b) For the conjugate priors, show that the posterior expectation of the mean 
parameters of the components can be written in a closed form. 

(c) Show that the convergence to stationarity is geometric for all the chains 
involved in the Gibbs sampler for the mixture model. 

(d) Show that Rao-Blackwellization applies in the setup of normal mixture 
models and that it theoretically improves upon the naive average. {Hint: 
Use the Duality Principle.) 

9.26 (Roeder and Wasserman 1997) In the setup of normal mixtures of Example 

9.2: 

(a) Derive the posterior distribution associated with the prior 7r(/i, r), where 
the Tj^’s are inverted Gamma XQ{v^ A) and 7r(/Xj|/ij_i, r) is a left-truncated 
normal distribution 

except for 7r(/ii) = l/^i. Assume that the constant B is known and 7v{A) = 

1/A. 

(b) Show that the posterior is always proper. 

(c) Derive the posterior distribution using the noninformat ive prior 

7r(/u,r) = p,qj ~ W[o,i). ~ V(0,C^) . 

and compare. 

9.27 Consider the following mixture of uniforms,^ 

P^[X,X+l] + (1 — 

and an ordered sample xi < • • • < Xm < • • < Xn such that Xm + 1 < Xm+i- 

(a) Show that the chain associated with the Gibbs sampler corre- 

sponding to [A. 34] is not irreducible. 

(b) Show that the above problem disappears if is replaced with 

(for n large enough). 

9.28 (Billio et al. 1999) A dynamic des equilibrium model is defined as the observa- 
tion of 

Yt = min(Fi*£,y2t), 

where the are distributed from a parametric joint model, f{yu^y 2 t)- 

(a) Give the distribution of {Yii,Y 2 t) conditional on Yt. 

(b) Show that a possible completion of the model is to first draw the regime (1 
versus 2) and then draw the missing component. 

(c) Show that when f{yu,y 2 t) is Gaussian, the above steps can be implemented 
without approximation. 

9.7 Notes 

9.7.1 Inference for Mixtures 

Although they may seem to apply only for some very particular sets of random 

phenomena, mixtures of distributions (9.5) are of wide use in practical modeling. 



^ This problem was suggested by Eric Moulines. 
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However, as already noticed in Examples 1.2 and 3.7, they can be challenging from 
an inferential point of view (that is, when estimating the parameters pj and ^j). 
Everitt (1984), Titterington et al. (1985), MacLachlan and Basford (1988), West 
(1992), Titterington (1996), Robert (1996a), and Marin et al. (2004) all provide dif- 
ferent perspectives on mixtures of distributions, discuss their relevance for modeling 
purposes, and give illustrations of their use in various setups of (9.5). 

We assume, without a considerable loss of generality, that /(-|0 belongs to an 
exponential family 

/(a:|^) = h{x) exp{$ • x - , 

and we consider the associated conjugate prior on ^ (see Robert 2001, Section 3.3) 

7t(^|q;o, A) (X exp{A(^ • ao - '0(C))} ? A > 0, ao e 

For the mixture (9.5), it is therefore possible to associate with each component 
/(■|Cj) U — 1, . • . , A:) a conjugate prior 7r(Cj|o:j, Xj). We also select for (pi, . . . ,pk) 
the standard Dirichlet conjugate prior; that is. 

Given a sample (xi, . . . ,Xn) from (9.5), and conjugate priors on ^ and (pi, . . . ,pk) 
(see Robert 2001, Section 3.3), the posterior distribution associated with this model 
is formally explicit (see Problem 9.25). However, it is virtually useless for large, or 
even moderate, values of n. In fact, the posterior distribution, 

n ( ^ 

7t(p,^|xi,...,x„) oc HE Pi f{Xi\^i) 
i=l lj=l 

is better expressed as a sum of terms which correspond to the different allocations 
of the observations Xi to the components of (9.5). Although each term is conjugate, 
the number of terms involved in the posterior distribution makes the computation 
of the normalizing constant and of posterior expectations totally infeasible for large 
sample sizes (see Diebolt and Robert 1990a). (In a simulation experiment, Casella 
et al. 1999 actually noticed that very few of these terms carry a significant posterior 
weight, but there is no manageable approach to determine which terms are relevant 
and which are not.) The complexity of this model is such that there are virtually no 
other solutions than using the Gibbs sampler (see, for instance. Smith and Makov 
1978 or Bernardo and Giron 1986, 1988, for pre-Gibbs approximations). 

The solution proposed by Diebolt and Robert (1990c,b, 1994), Lavine emd West 
(1992), Verdinelli and Wasserman (1992), and Escobar and West (1995) is to take 
advantage of the missing data structure inherent to (9.5), as in Example 9.2. 

Good performance of the Gibbs sampler is guaranteed by the above setup since 
the Duality Principle of Section 9.2.3 applies. One can also deduce geometric con- 
vergence and a Central Limit Theorem. Moreover, Rao-Blackwellization is justified 
(see Problem 9.25). 

The practical implementation of [A. 34] might, however, face serious convergence 
difficulties, in particular because of the phenomenon of the “absorbing component” 
(Diebolt and Robert 1990b, Mengersen and Robert 1996, Robert 1996a). When only 
a small number of observations are allocated to a given component jo , the following 
probabilities are quite small: 
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( 1 ) The probability of allocating new observations to the component jo. 

(2) The probability of reallocating, to another component, observations already 
allocated to jo . 

Even though the chain corresponding to [A. 34] is irreducible, the practical 

setting is one of an almost- absorbing state, which is called a trapping state as it 
requires an enormous number of iterations in [A. 34] to escape from this state. In 
the extreme case, the probability of escape is below the minimal precision of the 
computer and the trapping state is truly absorbing, due to computer “rounding 
errors.” 

This problem can be linked with a potential difficulty of this model, namely that 
it does not allow a noninformat ive (or improper) Bayesian approach, and therefore 
necessitates the elicitation of the hyperparameters 7 j, aj and Xj. Moreover, a vague 
choice of these parameters (taking for instance, 7 ^ = 1 / 2 , aj = ao, and small A^ ’s) 
often has the effect of increasing the occurrence of trapping states (Chib 1995). 

Example 9.23. (Continuation of Example 9.2) Consider the case where Xj 
1, Oj = 0 (j = 1, . . . , /c), and where a single observation XiQ is allocated to jo. Using 
the algorithm [A. 38], we get the approximation 

MjoIU'o (^^0 5 Uo ) ’ Uo ^ l‘^) 5 

so ^ fjj/ 4 1 and Xi^. Therefore, 

P{Zio =jo) (xP{Zia =ji) 

for ji ^ jo. On the other hand, if ii ^ io, it follows that 

P(Zii =jo) OC Pjo 

< Pji oiP{Zi^ = ji) , 

given the very rapid decrease of exp{— t^/2r^Q }. || 

An attempt at resolution of the paradox of trapping states may be to blame 
the Gibbs sampler, which moves too slowly on the likelihood surface, and to replace 
it by an Metropolis-Hastings algorithm with wider moves, as developed by Celeux 
et al. (2000). The trapping phenomenon is also related to the lack of a maximum 
likelihood estimator in this setup, since the likelihood is not bounded. (See Problem 
9.19 or Lehmann and Casella 1998.) 

9.7.2 ARCH Models 

Together with the techniques of Section 5.5.4 and Chapter 7, the Gibbs sampler can 
be implemented for estimation in an interesting class of missing data models that 
are used in Econometrics and Finance. 

A Gaussian ARCH (Auto Regressive Conditionally Heteroscedastic) model is 
defined, for t == 2, . . . , T, by 



(9.18) 



( Zt — {a 
\ Xt = aZt + £t, 
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where a G MF, si ~ A/’(0, 1), and St Ap(0, Z’) independently. Let 9 -= (a,P,a) 
denote the parameter of the model and assume, in addition, that Zi ~ AT( 0, 1). The 
sample is denoted by = (xi, . . . , xt), meaning that the zt's are not observed. (For 
details on the theory and use of ARCH models, see the review papers by Bollerslev 
et al. 1992 and Kim et al. 1998, or the books by Enders 1994 and Gourieroux 1996.) 
For simplicity and identifiability reasons, we consider the special case 

.QIQ^ (Zt={i + pz?_^y/hi £**~V(0,1), 

^ \Xt=aZt + et, Str^ Vp(0, a^Ip). 



Of course, the difficulty in estimation under this model comes from the fact that 
only the xt’s are observed. 

However, in missing data models such as this one (see also Billio et ed. 1998 ), 
the likelihood function L{0\x) can be written as the marginal of f{x,z\0) and the 
likelihood ratio is thus available in the form 



mx) p \ f{x,z\e) 

L{n\x) " l_/(a;,Z|77) 






and can be approximated by 



(9.20) 



^ m 



fjx,zi\e) 

f{x,Zi\r)) ’ 



Zi ~ f{z\x,rj) . 



(See Section 5.5.4 and Problem 5.11.) 





True 

9 


Starting 
value T} 


Approximate maximum 
likelihood estimate 9 


ai 


-0.2 


-0.153 


-0.14 


02 


0.6 


0.43 


0.42 


P 


0.8 


0.86 


0.99 




0.2 


0.19 


0.2 



Table 9.3. Estimation result for the factor ARCH model (9.19) with a simulated 
sample of size T = 100 and a Bayes estimate as starting value. {Source: Billio et al. 
1998.) 



The approximation of the likelihood ratio (9.20) is then based on the simulation 
of the missing data = (Zi, . . . , Zt) from 

f{z^\x^,0)ocf{z^,x^\0) 



(9.21) oca“^^exp^ ~ ~ J II {1 + > 



whose implementation using a Metropolis-Hastings algorithm (see Chapter 7) is 
detailed in Problem 9.24. Given a sample {zf , . . . , z^) simulated from (9.21), with 
9 = T], and an observed sample x^, the approximation (9.20) is given by 



1 f(z:T,x'^\0) 

^ f{zj,xi'\n) ’ 



(9.22) 
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where is defined in (9.21). 

Billio et al. (1998) consider a simulated sample of size T = 100 with p = 2, 
a = (—0.2, 0.6), j3 = 0.8, and cr^ = 0.2. The above approximation method is quite 
sensitive to the value of 77 and a good choice is the noninformat ive Bayes estimate 
associated with the prior 7 r(a, /?, cr) = 1/cr, which can be obtained by a Metropolis- 
Bastings algorithm (see Problem 9.24). Table 9.3 gives the result of the maximization 
of (9.22) for m = 50, 000. The maximum likelihood estimator /? is far from the true 
value /3, but the estimated log-likelihood ratio at (a, /3, d) is 0.348, which indicates 
that the likelihood is rather fiat in this region. 
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The Multi-Stage Gibbs Sampler 



In this place he was found by Gibbs, who had been sent for him in some 
haste. He got to his feet with promptitude, for he knew no small matter 
would have brought Gibbs in such a place at all. 

— G.K. Chesterton, The Innocence of Father Brown 



After two chapters of preparation on the slice and two-stage Gibbs samplers, 
respectively, we are now ready to envision the entire picture for the Gibbs 
sampler. We describe the general method in Section 10.1, whose theoretical 
properties are less complete than for the two-stage special case (see Section 
10.2): The defining difference between that sampler and the multi-stage ver- 
sion considered here is that the interleaving structure of the two-stage chain 
does not carry over. Some of the consequences of interleaving are the fact that 
the individual subchains are also Markov chains, and the Duality Principle 
and Rao-Blackwellization hold in some generality. None of that is true here, 
in the multi-stage case. Nevertheless, the multi-stage Gibbs sampler enjoys 
many optimality properties, and still might be considered the workhorse of 
the MCMC world. The remainder of this chapter deals with implementation 
considerations, many in connection with the important role of the Gibbs sam- 
pler in Bayesian Statistics. 



10.1 Basic Derivations 

10.1.1 Definition 

The following definition is a rather natural extension of what we saw in Section 
8.2 for the general slice sampler. Suppose that for some p > 1, the random 
variable X G A' can be written as X = (Xi,...,Xp), where the X^’s are 
either uni- or multidimensional. Moreover, suppose that we can simulate from 
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the corresponding univariate conditional densities /i, . . . , /p, that is, we can 
simulate 



|Xi , X2 \ 1 5 ♦ * • 7 ^ fi 1 ^ 1 5 ^2 5 • • • 5 ^i— 1 1 5 • • • 7 ^p) 

for i = 1, 2, . . . ,p. The associated Gibbs sampling algorithm (or Gibbs sampler) 
is given by the following transition from to 

Algorithm A,S9 -The Gibbs Sampler- 



Given \ generate 

2. xf), 


[AM] 







The densities /i, ... ,/p are called the full conditionals, and a particular 
feature of the Gibbs sampler is that these are the only densities used for 
simulation. Thus, even in a high-dimensional problem, all of the simulations 
may be univariate, which is usually an advantage. 

Example 10.1. Autoexponential model. The autoexponential model of 
Besag (1974) has been found useful in some aspects of spatial modeling. When 
y E R^, the corresponding density is 

/(yi, ^2, ys) OC exp{-(yi + 2/2 + 2/3 + 6>122 /i2/2 + 6>232/22/3 + ^3i2/32/i)} , 



with known Oij > 0. The full conditional densities are exponential. For exam- 
ple, 

^3|yi,2/2 + 6>232/2 + 6>3 i2/i) . 

They are thus very easy to simulate from. In contrast, the other conditionals 
and the marginal distributions have forms such as 



f{y2\yi) 

fivi) 



(X 



exp{-(j/i +V2 + Onym)} 



(X e 



-yi 



1 + ^ 232/2 + Osiyi 

exp{-2/2 - 6 >i22/i2/2} 

1 + ^ 232/2 + Osiyi 



L 



dV2 , 



which cannot be simulated easily. 
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Example 10.2. Ising model. For the Ising model of Example 5.8, where 



/(s)ocexp<^-ifV jV 






and where M denotes the neighborhood relation for the network, the full 
conditionals are given by 



f {Si\Sj^i) — 



exp{-Hsj - Jsj 

exp{-H - J Sj:(i,,)€Ar} + exp{H + J Yj Sy.(i,j)eu} 
exp{-{H + JYr.{i,j)eAr^j)i^i + 1)} 
l+exp{-2{H + JYj:Hj)eJ^^j)} 



It is therefore particularly easy to implement the Gibbs sampler [A. 39] for 
these conditional distributions by successively updating each node i of the 
network. (See Swendson and Wang 1987 and Swendson et al. 1992 for im- 
proved algorithms in this case, as introduced in Problem 7.43.) || 



Although the Gibbs sampler is, formally, a special case of the Metropolis- 
Hastings algorithm (or rather a combination of Metropolis-Hastings algo- 
rithms applied to different components; see Theorem 10.13), the Gibbs sam- 
pling algorithm has a number of distinct features: 

(i) The acceptance rate of the Gibbs sampler is uniformly equal to 1. There- 
fore, every simulated value is accepted and the suggestions of Section 7.6.1 
on the optimal acceptance rates do not apply in this setting. This also 
means that convergence assessment for this algorithm should he treated 
differently than for Metropolis-Hastings techniques. 

(ii) The use of the Gibbs sampler implies limitations on the choice of instru- 
mental distributions and requires a prior knowledge of some analytical or 
probabilistic properties of /. 

(iii) The Gibbs sampler is, by construction, multidimensional. Even though 
some components of the simulated vector may be artificial for the problem 
of interest, or unnecessary for the required inference, the construction is 
still at least two-dimensional. 

(iv) The Gibbs sampler does not apply to problems where the number of pa- 
rameters varies, as in Chapter 11, because of the obvious lack of irreducibil- 
ity of the resulting chain. 



10.1.2 Completion 

Following the mixture method discussed in Sections 5.3.1 and 8.2, the Gibbs 
sampling algorithm can be generalized by a “demarginalization” or completion 
construction. 




374 10 The Multi-Stage Gibbs Sampler 

Definition 10.3. Given a probability density /, a density g that satisfies 

/ g{x, z) dz = f{x) 

Jz 

is called a completion of /. 

The density g is chosen so that the full conditionals of g are easy to simulate 
from and the Gibbs algorithm [^.39] is implemented on g instead of /. For p > 
1, write y = (x, z) and denote the conditional densities of g{y) — g{y \^ . . . , yp) 
by 



Yi \V2 , . • ■ , 2/p ~ 9 i(yi \v2, ■■■,yp), 

5^2|2/i, 2/3, 52(2/212/1, 2/3, ■ • • , 2/p), 



Yplyi, ■ ■ ■ ,2/p-i ~ 9p{yp\yi, • ■ • , yp-i)- 
The move from to is then defined as follows. 



Algorithm A. 40 -Completion Gibbs Sampler- 
Given simulate 

1 . r/*"*"^* -- 5] (j/i , 2/p ^) , 

2 . ~ 92{y2\vi^^\ Vp^) , 



[^.40] 



The two-stage Gibbs sampler of Chapter 9 obviously corresponds to the 
particular case of [A. 40], where / is completed in g and x in y = (^ 1 ,^ 2 ) 
such that both conditional distributions gi{yi\y 2 ) and g 2 {y 2 \yi) are available. 
(Again, note that both y\ and y 2 can be either scalars or vectors.) 

In cases when such completions seem necessary (for instance, when every 
conditional distribution associated with / is not explicit, or when / is unidi- 
mensional) , there exist an infinite number of densities for which / is a marginal 
density. We will not discuss this choice in terms of optimality, because, first, 
there are few practical results on this topic (as it is similar to finding an op- 
timal density g in the Metropolis-Hastings algorithm) and, second, because 
there exists, in general, a natural completion of / in ^f. As already pointed out 
in Chapter 9, missing data models (Section 5.3.1) provide a series of examples 
of natural completion. 

In principle, the Gibbs sampler does not require that the completion of 
/ into g and oi x m y = {x,z) should be related to the problem of interest. 
Indeed, there are settings where the vector 2 : has no meaning from a statistical 
point of view and is only a useful device, that is, an auxiliary variable as in 
(2.8). 
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Example 10.4. Cauchy-Normal posterior distribution. As shown in 
Chapter 2 (Section 2.2), Student’s t distribution can be generated as a mixture 
of a normal distribution by a chi squared distribution. This decomposition 
is useful when the expression [1 4- (^ — appears in a more complex 

distribution. Consider, for instance, the density 



f{0\0o) oc 



e-eV2 

[i + ie-eorr' 



This is the posterior distribution resulting from the model 



A|6>-A/'(6>,1) and <9-C((9o,l) 

(see also Example 12.12). A similar function arises in the estimation of the 
parameter of interest in a linear calibration model (see Example 1.16). 

The density f{0\6o) can be written as the marginal density 

nOO 

f{6\eo)oc / g-[i+(0-0o)^] V2 1 

Jo 

and can, therefore, be completed as 

g{e,ri) oc g-[i+(e-eo)=] v /2 



which leads to the conditional densities 

giivW) = + ^ exp {-[1 + {0- 6>o)^] v/2}, 




that is, to a gamma and a normal distribution on 77 and respectively: 



?7|^ ~ Qa 



i + {e- Oof 



6\rj ~ J\f 



OqV 1 \ 

1 + 7 /’ 1 + 77 / 



Note that the parameter 77 is completely meaningless for the problem at hand 
but serves to facilitate computations. (See Problem 10.7 for further proper- 
ties.) II 



The Gibbs sampler of [A. 40] has also been called the Gibbs sampler with 
systematic scan (or systematic sweep), as the path of iteration is to proceed 
systematically in one direction. Such a sampler results in a non- reversible 
Markov chain, but we can construct a reversible Gibbs sampler with symmetric 
scan. The following algorithm guarantees the reversibility of the chain 
(see Problem 10.12). 
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Algorithm A*41 —Reversible Gibbs Sampler— 



Given . , . , , generate 




2. I 2 * ~sr2(y2jj/],yr.---:yp^) 




p-1. yp_i ^ 9p-i{yp-i\yh ■ • - >yp-2^Vp^) 

P- Yp ^ ~ , . . , , 

p+1 . - 9p- 1 (2/p-i |yT . ■ ■ - , ;y;-2> 2/^'^) 


\AA1] 


2p-l. 





An^ alternative to [A. 41] has been proposed by Liu et al. (1995) and is 
called Gibbs sampling with random scan^ as the simulation of the components 
of y is done in a random order following each transition. For the setup of 
[A. 40], the modification [A. 42] produces a reversible chain with stationary 
distribution and every simulated value is used. 

Algorithm A. 42 —Random Sweep Gibbs Sampler- 

1. Generate a permutation aGQp; 

2 . Simulate ~ 7^ CTi); \A.42] 

p+1. Simulate ^ (7p) . 

This algorithm improves upon [A. 41], which only uses one simulation out 
of two.^ (Recall that, following Theorem 6.65, the reversibility of allows 
application of the Central Limit Theorem.) 

10.1.3 The General Hammer sley-Clifford Theorem 

We have already seen the Hammersley-Clifford Theorem in the special case 
of the two-stage Gibbs sampler (Section 9.1.4). The theorem also holds in the 
general case and follows from similar manipulations of the conditional distri- 

^ Step p is only called once, since repeating it would result in a complete waste of 
time! 

^ Obviously, nothing prevents the use of all the simulations in integral approxima- 
tions. 
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butions.^ (Hammersley and Clifford 1970; see also Besag 1974 and Gelman 
and Speed 1993). 

To extend Theorem 9.3 to an arbitrary p, we need the condition of positivity 
given in Definition 9.4). We then have the following result, which specifies 
the joint distribution up to a proportionality constant, as in the two-stage 
Theorem 9.3. 



Theorem 10.5. Hammersley— Clifford. Under the positivity condition^ the 
joint distribution g satisfies 



5(2/1, 



P 

,5p) oc U 
1=1 



9ej {vij |5^I , ■ ■ • , yij -, , , • • • , 5^p) 

9ti {y't^ \yt , , • • • , 5^,-1 , , • • • , 5^p) 



for every permutation £ on {1, 2 , ... ,p} and every y' ey. 
Proof For a given y' ^ 



5(51, ■■■,yp)^ 5p(5pl5i, ■ ■ ■ , 5p-i)5^(yi, • • • ,5p-i) 

_ 5p(5p|yi,---,5p-i) / /X 

/ / I \ 5 • • • 5 1 2 i/py 

5p(5pl5i---,5p-i) 

_ 5p(5pl5i, • • • ,5p-i) 5p-i(5p-i|5i, • • ■,yp-2,y'p) 



y'p\yi, ■ ■ • ,5p-i) 5p-i(5p-il5i, • • ■,yp-2,y'p) 
X 5(5i,---,5p-i,5p) • 



A recursion argument then shows that 
5 ( 51 , •••,5p) 



A 9jiyj\yi,---,yj-i,yj+i,---,y'p) , , 

11 — TTi 7 ^ 5(5i,---,5p)- 

X n.l'ii' 1/Ji.i />/ . .. 'll' 'll' I ^ 



=1 9j{y'j\yi^ • • • 2 %-i2 y ^+12 • • • 2 yp 



The proof is identical for an arbitrary permutation 1. □ 

The extension of Theorem 10.5 to the non-positive case is more delicate 
and requires additional assumptions, as shown by Example 10.7. Besag (1994) 
proposes a formal generalization which is not always relevant in the setup 
of Gibbs sampling algorithms. Robert et al. (1997) modify Besag’s (1994) 
condition to preserve the convergence properties of these algorithms and show, 
moreover, that the connectedness of the support of g is essential for [A. 40] to 
converge under every regular parameterization of the model. (See Problems 
10.5 and 10.8.) 

This result is also interesting from the general point of view of MCMC 
algorithms, since it shows that the density g is known up to a multiplicative 

^ Clifford and Hammersley never published their result. Hammersley (1974) justifies 
their decision by citing the impossibility of extending this result to the non- 
positive case, ats shown by Moussouris (1974) through a counterexample. 
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constant when the conditional densities gi{yi\yj^i) are available. It is therefore 
possible to compare the Gibbs sampler with alternatives like Accept-Reject 
or Metropolis-Hastings algorithms. This also implies that the Gibbs sampler 
is never the single available method to simulate from g and that it is always 
possible to include Metropolis-Hastings steps in a Gibbs sampling algorithm, 
following an hybrid strategy developed in Section 10.3. 



10.2 Theoretical Justifications 

10.2.1 Markov Properties of the Gibbs Sampler 

In this section we investigate the convergence properties of the Gibbs sampler 
in the multi-stage case. The first thing we show is that most Gibbs samplers 
satisfy the minimal conditions necessary for ergodicity. We have mentioned 
that Gibbs sampling is a special case of the Metropolis-Hastings algorithm 
(this notion is formalized in Theorem 10.13). Unfortunately, however, this fact 
is not much help to us in the current endeavor. 

We will work with the general Gibbs sampler [A. 40] and show that the 
Markov chain converges to the distribution g and the subchain 

converges to the distribution /. It is important to note that although 
is, by construction, a Markov chain, the subchain is, typically, not a 

Markov chain, except in the particular case of two-stage Gibbs (see Section 
9.2). 

Theorem 10.6. For the Gibbs sampler of [A. 40], if is ergodic, then 

the distribution g is a stationary distribution for the chain (U^^^) and f is the 
limiting distribution of the subchain (X^^^) . 

Proof The kernel of the chain is the product 

(10.2) X(y,y') = gi(yily 2 , • ■ • , yp) 

X g2(y2lyuy3, . . . , yp) • • • yp(yp|yi, • . . , yp_i). 

For the vector y = (yi, y 2 , • • • , yp), let y"(yi, . . . , y^-i, y^+i, . . . , y^) denote the 
marginal density of the vector y with yi integrated out. If Y ~ y and A is 
measurable under the dominating measure, then 

P{Y'eA)=f lA{y') K{y,y') g{y) dy'dy 

= f lA(y') [gi{y'i\y 2 , ■ ■ ■ ,Vp) ■ ■ ■ 9p{y'p\y'i, ■ ■ ■ ,y'p-i)] 

X l 9 i{yi\y 2 , • • • , yp)9^(y2, • • • , yp)] dyi--- dypdy[ ...dy'p 

= f lA(y') g 2 {y 2 \y'i, ■■■,yp)- ■■ypiy'ply'i^ ■ ■ -,yp-i) 

X 9iy'i , 2 / 2 , ■ • ■ , yp)dy 2 ■ ■ ■ dypdy[ ■■■dy'p, 
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where we have integrated out yi and combined gi{yi\y 2 , . • • , yp)g^{y 2 , • • • , 2/p) 
= 9 {y'\,y 2 , ■ ■ -,yp)- Next, write g{y[,y2, ...,yp)= g{y2\y'i,y3, ■ ■ ■ ,yp)9^(y'i,y3, 
. . . , ^p), and integrate out y 2 to obtain 

P(Y' eA) = J lA(y') 93{y'3\y'i,y2> ■ ■ ■ , 9 p) ■ ■ ■ 9 p{yp\y[, ■ ■ ■ ,y'p-i) 

X 5 ( 2 / 1 , 2 / 2 . 2 / 3 , • • ■ , 2/p) dys--- dypdy[ ■■■dy'p. 

By continuing in this fashion and successively integrating out the ^^’s, the 
above probability is 



P{Y' e A) = [ g{y[,...,y'p)dy', 

J A 

showing that g is the stationary distribution. Therefore, Theorem 6.50 implies 
that is asymptotically distributed according to g and is asymptoti- 
cally distributed according to the marginal /, by integration. □ 

A shorter proof can be based on the stationarity of g after each step in the 
p steps of [A AO] (see Theorem 10.13). 

We now consider the irreducibility of the Markov chain produced by the 
Gibbs sampler. The following example shows that a Gibbs sampler need not 
be irreducible: 

Example 10.7. Nonconnected support. Let £ and £' denote the disks 
of with radius 1 and respective centers (1,1) and (-1,-1). Consider the 
distribution with density 

(10.3) f{xi,X2) = -^ {le{xi,X2) + ls'{xi,X2)] , 

which is shown in Figure 10.1. From this density we cannot produce an irre- 
ducible chain through [A. 40], since the resulting chain remains concentrated 
on the (positive or negative) quadrant on which it is initialized. (Note that a 
change of coordinates such z\ = x\ X 2 and Z 2 = x\ — X 2 sufficient to 
remove this difficulty.) || 

Recall the positivity condition^ given in Definition 9.4, and the consequence 
that two arbitrary Borel subsets of the support can be joined in a single 
iteration of [A. 40]. We therefore have the following theorem, whose proof is 
left to Problem 10.11. 

Theorem 10.8. For the Gibbs sampler o/[A.40], if the density g satisfies the 
positivity condition, it is irreducible. 

Unfortunately, the condition of Theorem 10.8 is often difficult to verify, 
and Tierney (1994) gives a more manageable condition which we will use 
instead. 
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Fig. 10.1. Support of the function f{xi,X 2 ) of (10.3) 



Lemma 10.9. If the transition kernel associated with [^.40] is absolutely con- 
tinuous with respect to the dominating measure, the resulting chain is Harris 
recurrent. 

The proof of Lemma 10.9 is similar to that of Lemma 7.3, where Harris 
recurrence was shown to follow from irreducibility. Here, the condition on the 
Gibbs transition kernel yields an irreducible chain, and Harris recurrence fol- 
lows. Once the irreducibility of the chain associated with [A. 40] is established, 
more advanced convergence properties can be established. 

This condition of absolute continuity on the kernel (10.2) is satisfied by 
most decompositions. However, if one of the i steps (1 < i < p) is replaced 
by a simulation from an Metropolis-Hastings algorithm, as in the hybrid al- 
gorithms described in Section 10.3, absolute continuity is lost and it will be 
necessary to either study the recursion properties of the chain or to introduce 
an additional simulation step to guarantee Harris recurrence. 

Prom the property of Harris recurrence, we can now establish a result 
similar to Theorem 7.4. 

Theorem 10.10. If the transition kernel of the Gibbs chain (P^^^) is abso- 
lutely continuous with respect to the measure fi, 

(i) If hi, h 2 e L^{g) with f h 2 {y)dg{y) ^ 0, then 

„^oo Yl=ih 2 {Yd)) fh 2 (y)dg(y) 

(it) If, in addition, (P^^^) is aperiodic, then, for every initial distribution fi, 

lim [ K^{yr)fj.{dx) - g =0. 
n-^oo J 
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We also state a more general condition, which follows from Lemma 7.6 by 
allowing for moves in a minimal neighborhood. 

Lemma 10.11. Let y = ( 2 / 1 , • • • , 2/p) and y' — ^ y'^) and suppose there 

exists J > 0 for which y, y' G supp{g), |y — y'| <6 and 

giivilyi, ■ ■ ■ ■ ■ ■ ,y'p) > o, i = i,...,p. 

If there exists 6' < S for which almost every pair (y,y') G supp{g) can he 
connected by a finite sequence of balls with radius 5' having the (stationary) 
measure of the intersection of two consecutive balls positive, then the chain 
produced by [A. 40] is irreducible and aperiodic. 

The laborious formulation of this condition is needed to accommodate 
settings where the support of g is not connected, as in Example 10.7, since a 
necessary irreducibility condition is that two connected components of supp(^) 
can be linked by the kernel of [^.40]. The proof of Lemma 10.11 follows from 
arguments similar to those in the proof of Lemma 7.6. 

Corollary 10.12. The conclusions of Theorem 10.10 hold if the Gibbs sam- 
pling Markov chain has conditional densities gi{yi\yi , . . . , yi-i , y'i^i, • . • , 

y'p) that satisfy the assumptions of Lemma 10.11. 



10.2.2 Gibbs Sampling as Metropolis— Hastings 

We now examine the exact relationship between the Gibbs sampling and 
Metropolis-Hastings algorithms by considering [A.40] as the composition of p 
Markovian kernels. (As mentioned earlier, this representation is not sufficient 
to establish irreducibility since each of these separate Metropolis-Hastings 
algorithms does not produce an irreducible chain, given that it only modifies 
one component of y. In fact, these kernels are never irreducible, since they are 
constrained to subspaces of lower dimensions.) 

Theorem 10.13. The Gibbs sampling method of [A.40] is equivalent to the 
composition of p Metropolis-Hastings algorithms, with acceptance probabilities 
uniformly equal to 1. 

Proof. If we write [A. 40] as the composition of p “elementary” algorithms 
which correspond to the p simulation steps from the conditional distribution, 
it is sufficient to show that each of these algorithms has an acceptance prob- 
ability equal to 1. For I < i < p, the instrumental distribution in step i of 
[A.40] is given by 



Qi{y \y) • • • 5 Vi—i^ 2/z-f Up) 



and the ratio defining the probability p{y, y') is therefore 
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g(yO (liivW) ^ g{y') 9i{yi\yi, • ■ • ,yi-i,yi+i,Vp) 
g{y) gi{y'\y) g{y) yi(yilyi,---,yi-i,yi+i,yp) 

^ gi{y'i \yi,-- -,91-1,91+1, yp) gi{yi\yi,---,yi-i, yt+i , yp) 
9 i{yi\yi,---,yi-i, yi+i , yp) 9 i{yi\yi , • ■ ■ , yi-i , y»+i , yp) 

= 1 . 



□ 

Note that Gibbs sampling is not the only MCMC algorithm to enjoy this 
property. As noted in Section 7.2, every kernel q associated with a reversible 
Markov chain with invariant distribution g has an acceptance probability uni- 
formly equal to 1 (see also Barone and Frigessi 1989 and Liu and Sabbati 
1999). However, the (global) acceptance probability for the vector (yi, . . . , yp) 
is usually different from 1 and a direct processing of [A. 40] as a particular 
Metropolis-Hastings algorithm leads to a positive probability of rejection. 

Example 10.14. (Continuation of Example 10.1) Consider the two- 
dimensional autoexponential model 

9 ( 91 , 92 ) OC exp{-yi - 2/2 - 0 i 2 yiy 2 } ■ 

The kernel associated with [j 4.40] is then composed of the conditional densities 

K[y,y') = gi{y'i\y2) 92 {y' 2 \y'i), 



with 

9 i{yi\y 2 ) = (1 + 6*12^2) exp{-(i + 6I122/2) yi}, 

y2(y2|yi) = (i + 6'i2yi) exp {-(i + 6»i2yi) y2} ■ 

The ratio 

y(yi>y2) -^((yi.y2)»(yi.y2)) ^ (i + ^i2y^(i + ^i2yi) ex2{v2vi-y[y2) 
g{yi,y2) -ft'((yi,y2),(y'i,y2)) (i + ^'i2y2)(i + 6'i2yi) 

is thus different from 1 for almost every vector (yi, 2/2, yi? ^2)- II 

In this global analysis of [A. 40], it is possible to reject some of the values 
produced by the sequence of steps 1,. . ., p of this algorithm. This version of 
the Metropolis-Hastings algorithm could then be compared with the original 
Gibbs sampler. No full-scale study has been yet undertaken in this direction, 
except for the modification introduced by Liu (1995, 1996a) and presented 
in Section 10.3. The fact that the first approach allows for rejections seems 
beneficial when considering the results of Gelman et al. (1996), but the Gibbs 
sampling algorithm cannot be evaluated in this way. 
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10.2.3 Hierarchical Structures 

To conclude this section, we investigate a structure for which Gibbs sampling 
is particularly well adapted, that of hierarchical models. These are structures 
in which the distribution / can be decomposed as (/ > 1) 

fix) = j fiix\zi)f 2 {zi\z 2 )---fi{zi\zi+i)fi+i{zi+i)dzi---dzi+i, 

for either structural or computational reasons. Such models naturally appear 
in the Bayesian analysis of complex models, where the diversity of the prior 
information or the variability of the observations may require the introduction 
of several levels of prior distributions (see Wakefield et al. 1994, Robert 2001, 
Chapter 10, Bennett et al. 1996, Spiegelhalter et al. 1996, and Draper 1998). 
This is, for instance, the case of Example 10.17, where the first level repre- 
sents the exchangeability of the parameters Aj , and the second level represents 
the prior information. In the particular case where the prior information is 
sparse, hierarchical modeling is also useful since diffuse (or noninformat ive) 
distributions can be introduced at various levels of the hierarchy. 

The following two examples illustrate situations where hierarchical models 
are particularly useful (see also Problems 10.29-10.36). 

Example 10.15. Animal epidemiology. Research in animal epidemiology 
sometimes uses data from groups of animals, such as litters or herds. Such data 
may not follow some of the usual assumptions of independence, etc., and, as a 
result, variances of parameter estimates tend to be larger (this phenomenon is 
often referred to as “overdispersion”). Schukken et al. (1991) obtained counts 
of the number of cases of clinical mastitis^ in dairy cattle herds over a one 
year period. 

If we assume that, in each herd, the occurrence of mastitis is a Bernoulli 
random variable, and if we let i = 1, . . . , m, denote the number of cases in 
herd i, it is then reasonable to model Xi ~ V{\i)^ where is the underlying 
rate of infection in herd i. However, there is lack of independence here (mastitis 
is infectious), which might manifest itself as overdispersion. To account for 
this, Schukken et al. (1991) put a gamma prior distribution on the Poisson 
parameter. A complete hierarchical specification is 

- V{Xi), 

Xi ~ 

Pi ~ Ga{a,b), 

where a, a, and b are specified. The posterior density of A^, 7r(Ai|x, a), can 
now be simulated via the Gibbs sampler 

Xi ~ 7r(Ai|x,a,/?i) = Qa{xi + a, 1 + A), 

Pi ^ 7r(/^i|x, a, a, 6, A^) = Qa{a -h a, A^ -h 6) . 

^ Mastitis is an infiammation usually caused by infection. 
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Figure 10.2 shows selected estimates of and Pi, For more details see Problem 
10.10 or Eberly and Casella (2003). || 





Iteration 



Iteration 



Iteration 



Fig. 10.2. Histograms, density estimates and running mean estimates for three 
selected parameters of the mastitis data (Example 10.15). 



Example 10 . 16 . Medical models. Part of the concern of the study of 
pharmacokinetics is the modeling of the relationship between the dosage of 
a drug and the resulting concentration in the blood. (More generally, phar- 
macokinetics studies the different interactions of a drug and the body.) Gilks 
et al. (1993) introduce an approach for estimating pharmacokinetic param- 
eters that uses the traditional mixed-effects model and nonlinear structure, 
but which is also robust to the outliers common to clinical trials. For a given 
dose di administered at time 0 to patient i, the measured log concentration 
in the blood at time Uj^ is assumed to follow a Student’s t distribution 
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where = (logC^, logT^)' is a vector of parameters for the ith individual, 
is the measurement error variance, and gij is given by 

««(*<)= I 

(The parameter C{ represents clearance and Vi represents volume for patient 
i.) Gilks et al. (1993) then complete the hierarchy by specifying an noninfor- 
mative prior on cr, 7r(cr) — Xja and 

\ir^M(e,E) , 

0~V(ri,Ti), and ~ W 2 (t 2 ,T 2 ), 

where the values of ri, Ti, T 2 , and T 2 are specified. Conjugate structures can 
then be exhibited for most parameters by using the Dickey’s decomposition 
of the Student’s t distribution (see Example 4.5); that is, by associating to 
each Xij an (artificial) variable uoij such that 



Xij \LOij 



N ( log gij{\i),u;r 



Using this completion, the full conditional distributions on the C^’s and 6 
are normal, while the full conditional distributions on and E are inverse 
gamma and inverse Wishart, respectively. The case of the V^’s is more difficult 
to handle since the full conditional density is proportional to 
(10.4) 



exp 



-1 

T 



E 



iogv; + ^ 



l^i 



e + (logl"i 



7*)VC 



where the hyperparameters C, 7i, and ( depend on the other parameters and 
Xij. Gilks et al. (1993) suggest using an Accept-Reject algorithm to simulate 
from (10.4), by removing the CitijiVi terms. Another possibility is to use a 
Metropolis-Hastings step, as described in Section 10.3.3. || 



For some cases of hierarchical models, it is possible to show that the asso- 
ciated Gibbs chains are uniformly ergodic. Typically, this can only be accom- 
plished on a case-by-case study, as in the following example. Schervish and 
Carlin (1992) studies the weaker property of geometric convergence of [A. 40] 
in some detail, which requires conditions on the kernel that can be difficult to 
assess in practice. 

Example 10.17. Nuclear pump failures. Gaver and O’Muircheartaigh 
(1987) introduced a model that is frequently used (or even overused) in the 
Gibbs sampling literature to illustrate various properties (see, for instance, 
Gelfand and Smith 1990, Tanner 1996, or Guihenneuc-Jouyaux and Robert 
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Pump 1 23 456789 10 

Failures 5 1 5 14 3 19 1 1 4 22 

Time 94.32 15.72 62.88 125.76 5.24 31.44 1.05 1.05 2.10 10.48 



Table 10.1. Numbers of failures and times of observation of 10 pumps in a nuclear 
plant {Source: Gaver and O’Muircheartaigh 1987). 



1998). This model describes multiple failures of pumps in a nuclear plant, with 
the data given in Table 10.1. The modeling is based on the assumption that 
the failures of the ith pump follow a Poisson process with parameter (1 < 
i < 10). For an observed time the number of failures pi is thus a Poisson 
V{Xiti) random variable. The associated prior distributions are (1 < i < 10) 

Ai ~ Ga{a, (3), p ^a( 7 , S), 

with a = 1.8, 7 = 0.01, and 5=1 (see Gaver and O’Muircheartaigh 1987 for 
a motivation of these numerical values). The joint distribution is thus 



7t(Ai, . . . , Aio,/?|ti, . . . • • • ,Pio) 



10 



oc 






Sf3 



i=l 

10 



OC 



n{^r i+OL-l ^10q!+7-1^ 



-5(3 



2=1 



and a natural decomposition^ of tt in conditional distributions is 



\i\P,ti,Pi ^ Qa{pi + a,ti + /3) 



(1 < i < 10), 



10 



/?|Ai, . . . , Aio ^ Qa 7 + 10a, 5 + 



2=1 



The transition kernel on /3 associated with [A. 40] and this decomposition 
satisfies (see Problem 10.3) 






7+lOa 



10 



exp ' 



-/?' S + Y^ Xi 



2=1 



2=1 



^ ^ ex.p{-{ti+ P)Xi} d\i...dXio 



(10.5) 



> 



j7-|-10Q;^^/j7+10a-l 

r(10a-h7) 



10 / , X Pi-\-a 



2=1 



ti + P' 



^ This decomposition reflects the hierarchical structure of the model. 
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This minorization by a positive quantity which does not depend on 8 implies 
that the entire space (M-i-) is a small set for the transition kernel; thus, by 
Theorem 6.59, the chain (/3^) is uniformly ergodic (see Definition 6.58). 

As shown in Section 9.2.3, uniform ergodicity directly extends to the dual 

chain A* = II 

Note that Examples 10.15 and 10.17 are cases of two-stage Gibbs samplers, 
which means that and are interleaved Markov chains (Definition 

9.10). As noted earlier, the interesting features of interleaving disappear for 
p > 3, because the subchains (t/^^ ),..., (1^^*^) are not Markov chains, al- 
though the vector (Y^^^) is. Therefore, there is no transition kernel associated 
with ( ) , and the study of uniform ergodicity can only cover a grouped vec- 

tor of (p — 1) of the p components of (Y^^^) since the original kernel cannot, 
in general, be bounded uniformly from below. If we denote zi — (p 2 , • • • , Vp), 
the transition from zi to z[ has the following kernel: 

Ki{zi,z[) = / gi{y'i\7.i)g2{y2\y'i,y3,---,yp)--- 

Jyi 

(10-6) ^gp{yp\y'i,y 2 ^ ■ ■ ■ ,y'p-i) dy [ . 

While some setups result in a uniform bound on Xi(zi, z'^), it is often impos- 
sible to achieve a uniform minorization of K. For example, it is impossible 
to bound the transition kernel of the autoexponential model (Example 10.1) 
from below. 



10.3 Hybrid Gibbs Samplers 

10.3.1 Comparison with Metropolis— Hastings Algorithms 

A comparison between a Gibbs sampling method and an arbitrary Metropo- 
lis-Hastings algorithm would seem a priori to favor Gibbs sampling since it 
derives its conditional distributions from the true distribution /, whereas a 
Metropolis-Hastings kernel is, at best, based on a approximation of this dis- 
tribution /. In particular, we have noted several times that Gibbs sampling 
methods are, by construction, more straightforward than Metropolis-Hastings 
methods, since they cannot have a “bad” choice of the instrumental distribu- 
tion and, hence, they avoid useless simulations {rejections). Although these 
algorithms can be formally compared, seeking a ranking of these two main 
types of MCMC algorithms is not only illusory but also somewhat pointless. 

However, we do stress in this section that the availability and apparent ob- 
jectivity of Gibbs sampling methods are not necessarily compelling arguments. 
If we consider the Gibbs sampler of Theorem 10.13, the Metropolis-Hastings 
algorithms which underly this method are not valid on an individual basis 
since they do not produce irreducible Markov chains (Y^^^). Therefore, only 
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a combination of a sufficient number of Metropolis-Hastings algorithms can 
ensure the validity of the Gibbs sampler. This composite structure is also a 
weakness of the method, since a decomposition of the joint distribution / 
given a particular system of coordinates does not necessarily agree with the 
form of /. Example 10.7 illustrates this incompatibility in a pathological case: 
A wrong choice of the coordinates traps the corresponding Gibbs sampling in 
one of the two connected components of the support of /. Hills and Smith 
(1992, 1993) also propose examples where an incorrect parameterization of 
the model significantly increases the convergence time for the Gibbs sampler. 
As seen in Note 9.7.1 in the particular case of mixtures of distributions, pa- 
rameterization infiuences the performances of the Gibbs sampler to the extent 
of getting into a trapping state. 

To draw an analogy, let us recall that when a function v{yi^ . . . ^yp) is 
maximized one component at a time, the resulting solution is not always 
satisfactory since it may correspond to a saddlepoint or to a local maximum 
of V. Similarly, the simulation of a single component at each iteration of [A. 40] 
restricts the possible excursions of the chain and this implies that Gibbs 

sampling methods are generally slow to converge, since they are slow to explore 
the surface of /. 

This intrinsic defect of the Gibbs sampler leads to phenomena akin to con- 
vergence to local maxima in optimization algorithms, which are expressed by 
strong attractions to the closest local modes and, in consequence, to difficulties 
in exploring the entire range of the support of /. 

Example 10.18. Two-dimensional mixture. Consider a two-dimensional 
mixture of normal distributions, 

(10.7) PlA/2(Atl, ^l) + P2A/2(M2, Y 2 ) + P3J^2{f^3^ ^s), 

given in Figure 10.3 as a gray- level image. ^ Both unidimensional conditionals 
are also mixtures of normal distributions and lead to a straightforward Gibbs 
sampler. The first 100 steps of the associated Gibbs sampler are represented 
on Figure 10.3; they show mostly slow moves along the two first components 
of the mixture and a single attempt to reach the third component, which is 
too far in the tail of the conditional. Note that the numerical values chosen 
for this illustration are such that the third component has a 31% probability 
mass in (10.7). (Each step of the Gibbs sampler is given in the graph, which 
explains for the succession of horizontal and vertical moves.) || 



10.3.2 Mixtures and Cycles 

The drawbacks of Metropolis-Hastings algorithms are different from those of 
the Gibbs sampler, as they are more often related to a bad agreement between 

® As opposed to the other mixture examples, this example considers simulation 
from (10.7), which is a more artificial setting, given that direct simulation is 
straightforward. 
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Fig. 10.3. Successive (full conditional) moves of the Gibbs chain on the surface of 
the stationary distribution (10.7), represented by gray levels (darker shades mean 
higher elevations). 



/ and the instrumental distribution. Moreover, the freedom brought by Me- 
tropolis-Hastings methods sometimes allows for remedies to these drawbacks 
through the modification of some scale (parameters or hyperparameters are 
particularly useful). 

Compared to the Gibbs sampler, a failing of Metropolis-Hastings algo- 
rithms is to miss the finer details of the distribution /, if the simulation is 
“too coarse.” However, following Tierney (1994), a way to take advantage of 
both algorithms is to implement a hybrid approach which uses both Gibbs 
sampling and Metropolis-Hastings algorithms. 

Definition 10.19. An hybrid MCMC algorithm is a Markov chain Monte 
Carlo method which simultaneously utilizes both Gibbs sampling steps and 
Metropolis-Hastings steps. If , Kn are kernels which correspond to 

these different steps and if (oi, . . . , an) is a probability distribution, a mixture 
of ATi, AT 2 , • • • 5 Kn is an algorithm associated with the kernel 

K = a\Ki + • • • + anKn 

and a cycle of Ki, 1 ^ 2 , • • • , Kn is the algorithm with kernel 

K* = o . . . o Xn, 
where “o” denotes the composition of functions. 

Of course, this definition is somewhat ambiguous since Theorem 10.13 
states that the Gibbs sampling is already a composition of Metropolis-Has- 
tings kernels; that is, a cycle according to the above definition. Definition 10.19 
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must therefore be understood as processing heterogeneous MCMC algorithms, 
where the chain under study is a subchain of the chain produced by 

the algorithm. 

Prom our initial perspective concerning the speed of convergence of the 
Gibbs sampler, Definition 10.19 leads us to consider modifications of the initial 
algorithm where, every m iterations, [^.40] is replaced by a Metropolis-Has- 
tings step with larger dispersion or, alternatively, at each iteration, this Me- 
tropolis-Hastings step is selected with probability 1/m. These modifications 
are particularly helpful to escape “trapping effects” related to local modes of 
/• 

Hybrid procedures are valid from a theoretical point of view, when the 
heterogeneity of the chains generated by cycles is removed by considering only 
the subchains (although the entire chain should be exploited). A 

composition of kernels associated with an identical stationary distribution / 
leads to a kernel with stationary distribution /. The irreducibility and ape- 
riodicity of K directly follows from the irreducibility and the aperiodicity of 
one of the i kernels Ki. In the case of a cycle^ we already saw that the ir- 
reducibility ol K* for the Gibbs sampler does not require the irreducibility 
of its component kernels and a specific study of the algorithm at hand may 
sometimes be necessary. Tierney (1994) also proposes sufficient conditions for 
uniform ergodicity. 

Proposition 10.20. If K\ and K 2 are two kernels with the same station- 
ary distribution f and if K\ produces a uniformly ergodic Markov chain, the 
mixture kernel 



K = aKi + (1 — a)K2 (0 < a < 1) 

is also uniformly ergodic. Moreover, if X is a small set for K\ with m = 1, 
the kernel cycles K\ o K 2 and K 2 o K\ are uniformly ergodic. 

Proof. 11 K I produces a uniformly ergodic Markov chain, there exists m G N, 
£m > 0, and a probability measure Um such that Ki satisfies 

K^{x, A) > emiym{A), yx eX, VA € 8{X) . 

Therefore, we have the minorization condition 



{aKi + (1 - a)K2r{x,A) > a^KY^{x,A) > o"^e^i/^(A), 



which, from Theorem 6.59, establishes the uniform ergodicity of the mixture 
kernel. 

If A is a small set for Ki with m = 1, we have the minorizations 



{Kio K 2 ){x,A) = [ f K 2 {x,dy) Ki{y,dz) 

Ja Jx 

l (^) f K2{x,dy) = eii/i{A) 

JX 



> eiUi{ 




and 
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{K 2 oKi){x,A) = / / Ki{x,dy) K 2 {y,dz) 

Ja Jx 

> £i / [ vi{dy)K2{x,dy) = ei{K2 0Ui){A). 

JA JX 

Prom Theorem 6.59, both cycles are therefore uniformly ergodic. □ 

These results are not only formal since it is possible (see Theorem 7.8) to 
produce a uniformly ergodic kernel from an independent Metropolis-Hastings 
algorithm with instrumental distribution g such that f/g is bounded. Hybrid 
MCMC algorithms can therefore be used to impose uniform ergodicity in an 
almost automatic way. 

The following example, due to Nobile (1998), shows rather clearly how the 
introduction of a Metropolis-Hastings step in the algorithm speeds up the 
exploration of the support of the stationary distribution. 

Example 10.21. Probit model. A (dichotomous) probit model is defined 
by random variables Di {I < i < n) such that (1 < j < 2) 

( 10 . 8 ) P{Di = 1 ) = 1 - P{Di = 0 ) = P{Zi > 0 ) 

with Zi ~ /? G M, being a covariate. (Note that the Zi's are 

not observed. This is a special case of a latent variable models as mentioned 
in Sections 1.1 and 5.3.1. See also Problem 10.14.) For the prior distribution 

<7-2 ~ 00(1.5, 1.5) , /1|<7 ~ V(0, 1Q2) , 

Figure 10.4 plots the 20,000 first iterations of the Gibbs chain 
against some contours of the true posterior distribution. (See Example 12.1 
for further details.) The exploration is thus very poor, since the chain does not 
even reach the region of highest posterior density. (A reason for this behavior 
is that the likelihood is quite uninformative about (/?, cr), providing only a 
lower bound on /?/cr, as explained in Nobile 1998. See also Problem 10.14.) 

A hybrid alternative, proposed by Nobile (1998), is to insert a Metropo- 
lis-Hastings step after each Gibbs cycle. The proposal distribution merely 
rescales the current value of the Markov chain by a random scale factor c, 
drawn from an exponential £xp(l) distribution (which is similar to the “hit- 
and-run” method of Chen and Schmeiser (1993). The rescaled value cy^^^ is 
then accepted or rejected by a regular Metropolis-Hastings scheme to become 
the current value of the chain. Figure 10.5 shows the improvement brought by 
this hybrid scheme, since the MCMC sample now covers most of the support 
of the posterior distribution for the same number of iterations as in Figure 
10.4. II 
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Fig. 10.4. Sample of obtained by the Gibbs sampler plotted with some 

contours of the posterior distribution of (/3,a) for 20,000 observations from (10.8), 
when the chain is started at (/3, a) = (25, 5). 




0 10 20 30 40 

Fig. 10.5. Sample of obtained by an hybrid MCMC algorithm plotted 

with the posterior distribution for the same 20, 000 observations as in Figure 10.4 
and the same starting point. 



10.3.3 Metropolizing the Gibbs Sampler 

Hybrid MCMC algorithms are often useful at an elementary level of the sim- 
ulation process; that is, when some components of [^.40] cannot be easily 
simulated. Rather than looking for a customized algorithm such as Accept- 
Reject in each of these cases or for alternatives to Gibbs sampling, there is a 
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compromise suggested by Muller (1991, 1993) (sometimes called “Metropolis- 
wit bin- Gibbs”). In any step i of the algorithm [A.40] with a difficult sim- 
ulation from gi{yi\yj^j 7^ i), substitute a simulation from an instrumental 
distribution Qi. In the setup of [^.40], Muller’s (1991) modification is 



Algorithm A *43 -Hybrid MCMC- 

For * = 1 P, given . . . , , i/p ') : 

1 * Simulate 

Vi ~ , yf \ • 1 yp 

2 * Take 



[A.43] 



y. 



(t+i) _ with probability 1 - /?, 

with probability p. 



where 



p^lAl 



' / 9i(yi\yi^\ 




,!/?') 'll 




y(‘ + l) .y(0 

, Vi- 1 : Pi+l 1 






(t+i) (0 

.yi+l:-- 


.,yf) ^ 




(f+1) (t) 


...,y‘‘VJ 



An important point about this substitution is that the above Metropo- 
lis-Hastings step is only used once in an iteration from [A. 40]. The modified 
step thus produces a single simulation yi instead of trying to approximate 
9i{yi\yj^j ^ ^) more accurately by producing T simulations from qi. The 
reasons for this choice are twofold: First, the resulting hybrid algorithm is 
valid since g is its stationary distribution (see Section 10.3.2 or Problems 
10.13 and 10.15.). Second, Gibbs sampling also leads to an approximiation of 
g. To provide a more “precise” approximation of gi{yi\yjj ^ i) in [A. 43] does 
not necessarily lead to a better approximation of g and the replacement of gi 
by qi may even be beneficial for the speed of excursion of the chain on the 
surface of g? (See also Chen and Schmeiser 1998.) 

When several Metropolis-Hastings steps appear in a Gibbs algorithm, ei- 
ther because of some complex conditional distributions (see Besag et al. 1995) 
or because of convergence contingencies, a method proposed by Muller (1993) 
is to run a single acceptance step after the p conditional simulations. This 
approach is more time-consuming in terms of simulation (since it may result 
in the rejection of the p simulated components), but the resulting algorithm 

^ The multi-stage Gibbs sampler is itself an illustration of this, as announced in 
Section 9.5. 
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can be written as a simple Metropolis-Hastings method instead of a (hybrid) 
combination of Metropolis-Hastings algorithms. Moreover, it produces an ap- 
proximation q{y) of the distribution g{y) rather than local approximations of 
the conditional distributions of the subvectors yi. 

We conclude this section with a surprising result of Liu (1995), who shows 
that, in a discrete state setting, Gibbs sampling can be improved by Metro- 
polis-Hastings steps. The improvement is expressed in terms of a reduction of 
the variance of the empirical mean of the /i(y^^^)’s for the Metropolis-Hastings 
approach. This modification, called Metropolization by Liu (1995), is based on 
the following result, established by Peskun (1973), where Ti <C T 2 means that 
the non-diagonal elements of T 2 are larger that those of Ti (see Problem 7.39 
for a proof). 

Lemma 10.22. Consider two reversible Markov chains on a countable state- 
space, with transition matrices Ti and T 2 such that Ti T 2 . The chain asso- 
ciated with T 2 dominates the chain associated with Ti in terms of variances. 

Given a conditional distribution gi{yi\yj,j ^ i) on a. discrete space, the 
modification proposed by Liu (1995) is to use an additional Metropolis-Has- 
tings step. 



Algorithm A. 44 -Metropolization of the Gibbs Sampler- 



Given , 

1. Simulate Zi^yf^ with probability 




9i{zi\yj\j ¥^i) 




2. Accept = Zi with probability 


[AA4] 







The probability of moving from y^^^ to a different value is then necessarily 
higher in [A. 44] than in the original Gibbs sampling algorithm and Lemma 
10.22 implies the following domination result. 

Theorem 10.23. The modification [A. 44] of the Gibbs sampler is more effi- 
cient in terms of variances. 



Example 10.24. (Continuation of Example 9.8) For the aggregated 
multinomial model of Tanner and Wong (1987), the completed variables Zi 
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O 2000 4000 6000 GOOO 10000 O 2000 4000 6000 6000 1OOO0 




O 2000 4000 6000 8000 1 0000 0 2000 4000 6000 6000 1 0000 



Fig. 10.6. Comparison of the Gibbs sampling (left) and of the modification of 
Liu (1995) (right), for the estimation of E[/i|o:] (top) and of E[?7|a:] (bottom). The 
90% confidence regions on both methods have been obtained on 500 parallel chains. 



{i = 1,...,4) take values in {0,1,..., and can, therefore, be simulated 
from [A. 44], that is, from a binomial distribution 



B(xi, — - - - ^ for z = 1,2 and B ( Xi , — ^ fori = 3, 4, 

\ hi J \ aiT] -\-hiJ 

until the simulated value Zi is different from . This new value Zi is accepted 
with probability (i = 1 , 2 ) 




and (i = 3, 4) 





{ai-n) ' b^' ' 


Xi ) 


{aiT} + bif' 


( 




\Xi) 


{aiT] + bif" 



Figure 10.6 describes the convergence of estimations of /x and rj under the two 
simulation schemes for the following data: 



(ai, U 2 , as, U 4 ) = (0.06,0.14,0.11,0.09), 
(^ 1 ,^ 2 , 63 , 64 ) - (0.17,0.24,0.19,0.20), 
{xi,X2,xs,X4,X5) = (9,15,12,7,8) . 
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n 10 100 1000 10,000 

0.302 0.0984 0.0341 0.0102 
0.288 0.0998 0.0312 0.0104 
T] 0.234 0.0803 0.0274 0.00787 
0.234 0.0803 0.0274 0.00778 



Table 10.2. 90% interquantile ranges for the original Gibbs sampling (top) and 
the modification of Liu (1995) (bottom), for the estimation of E[/x|x] and E[t 7 |x]. 



As shown by Table 10.2, the difference on the variation range of the ap- 
proximations of both E[/x|x] and E[ry|x] is quite minor, in particular for the 
estimation of r], and are not always larger for the original Gibbs sampler. || 



10.4 Statistical Considerations 

10.4.1 Reparameterization 

Convergence of both Gibbs sampling and Metropolis-Hastings algorithms may 
suffer from a poor choice of parameterization (see the extreme case of Exam- 
ple 10.7). As a result of this, following Hills and Smith (1992), the MCMC 
literature has considered changes in the parameterization of a model as a way 
to speed up convergence in a Gibbs sampler. (See Gilks and Roberts 1996, 
for a more detailed review.) It seems, however, that most efforts have concen- 
trated on the improvement of specific models, resulting in a lack of a general 
methodology for the choice of a “proper” parameterization. Nevertheless, the 
overall advice is to try to make the components “as independent as possible.” 

As noted in Example 10.18, convergence performances of the Gibbs sam- 
pler may be greatly affected by the choice of the coordinates. For instance, 
if the distribution ^ is a A/ 2 ( 0 ,X’) distribution with a covariance matrix E 
such that its eigenvalues satisfy Amin(^) Amax(^) and its eigenvectors cor- 
respond to the first and second diagonals of R^, the Gibbs sampler based 
on the conditionals g{xi\x 2 ) and g{x 2 \xi) is very slow to explore the entire 
range of the support of g. However, if yi = xi X 2 and ^2 = ~ ^2 is the 

selected parameterization (which corresponds to the coordinates in the eigen- 
vector basis) the Gibbs sampler will move quite rapidly over the support of g. 
(Hobert et al. 1997 have shown that the influence of the parameterization on 
convergence performances may be so drastic that the chain can be irreducible 
for some par ameterizat ions and not irreducible for others.) 

Similarly, the geometry of the selected instrumental distribution in a Me- 
tropolis-Hastings algorithm must somehow correspond to the geometry of 
the support of g for good acceptance rates to be obtained. In particular, as 
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pointed out by Hills and Smith (1992), if a normal second-order approximation 
to the density g is used, the structure of the Hessian matrix will matter in 
a Gibbs implementation, whereas the simplification brought by a Laplace 
approximation must be weighted against the infiuence of the parameterization. 
For instance, the value of the approximation based on the Taylor expansion 

h{e) = h{6) + \{e - 0)*(vv‘ - e) , 

with h{6) = log g{6), depends on the choice of the parameterization. 

Gelfand and Sahu (1994), Gelfand et al. (1995, 1996) have studied the 
effects of parameterization on different linear models. 

Example 10.25. Random effects model. Consider the simple random 
effects model 

Y{j = o^i Sij , i = l,...,/, jf = l,...,t7, 

where ai ~ A/’(0,cr^) and £ij ~ jV'(0, cr^). For a fiat prior on /i, the Gibbs 
sampler implemented for the (/i, oi, . . . , o/) parameterization exhibits high 
correlation and consequent slow convergence if cr^/(/Jcr^) is large (see Prob- 
lem 10.22). On the other hand, if the model is rewritten as the hierarchy 

Yij ~ U{r]i ,al), al ) , 

the correlations between the r/^’s and between ji and the ry^’s are lower (see 
Problem 10.23). 

Another approach is suggested by Vines and Gilks (1994) and Vines et al. 
(1995), who eliminate the unidentifiability feature by so-called sweeping; that 
is, by writing the model as 

Yij = ^ Sij^ 

with (1 < i < /, 1 < j < J) 

= ipi-^H{Q,al{l-{l/I))) , 



and coY{ipi^ipj) = This choice leads to even better correlation struc- 

tures since the parameter v is independent of the while ooxY{jpi^^j) — 
— a posteriori. || 

The end of this section deals with a special feature of mixture estimation 
(see Example 9.2), whose relevance to MCMC algorithms is to show that a 
change in the parameterization of a model can accelerate convergence. 

As a starting point, consider the paradox of trapping states, discussed in 
Section 9.7.1. We can reassess the relevance of the selected prior distribution, 
which may not endow the posterior distribution with sufficient stability. The 
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conjugate distribution used in Example 9.2 ignores the specific nature of the 
mixture (9.5) since it only describes the completed model, and creates an in- 
dependence between the components of the mixture. An alternative proposed 
in Mengersen and Robert (1996) is to relate the various components through 
a common reference, namely a scale, location, or scale and location parameter. 

Example 10.26. (Continuation of Example 9.2) For instance, normal 
distributions are location-scale distributions. Each component can thus be 
expressed in terms of divergence from a global location, //, and a global scale, 
r. Using somewhat abusive (but convenient) notation, we write 

k 

(10.9) ^ pj Af{p + rOj , T^cr]), 

3 = ^ 

also requiring that 6\ — 0 and cri = 1, which avoids overparameterization. 
Although the model (10.9) is an improvement over (9.5), it is still unable to 
ensure proper stability of the algorithm.^ The solution proposed in Robert 
and Mengersen (1999) is, instead, to express each component as a perturba- 
tion of the previous component, which means using the location and scale 
parameters of the previous component as a local reference. Starting from the 
normal distribution the two-component mixture 

pM (/X, T^) + (1 - + t6, 



is thus modified into 

pj\f (/X, r^) + (1 -p)qN{pi + t6, r^cr^) -h (1 - p){l - q)M{pi + + tcfs, , 

a three-component mixture. 

The k component version of the reparameterization of (9.5) is 
pM{p,,T^) + (1 -p) • • • (1 - qk-2) + ... + T---ak-26k~i,r‘^ • --(^k-i) 

k-2 

(10.10) + ^ (1 - p) • • • (1 - qj-i)qj Af{p + h r ■ • • ■ 

J = 1 

The mixture (10.10) being invariant under permutations of indices, we can 
impose an identifiability constraint^ for instance, 

(10.11) cTi < 1, . . . ,afc_i < 1 . 

The prior distribution can then be modified to 

(10.12) 7^(p,T)=r“^ P,9j~^o,i]> ^j~V(0,C^). 

^ It reproduces the feature of the original parameterization, namely of independence 
between the parameters conditional on the global location-scale parameter. 
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The representation (10.10) allows the use of an improper distribution on (/i, r), 
and (10.11) justifies the use of a uniform distribution on the cr^’s (Robert 
and Titterington 1998). The infiuence of the only hyperparameter, C, on the 
resulting estimations of the parameters is, moreover, moderate, if present. 
(See Problems 9.26 and 10.39.) || 



Example 10.27. Acidity level in lakes. In a study of acid neutralization 
capacity in lakes, Crawford et al. (1992) used a Laplace approximation to the 
normal mixture model 



(10.13) 



3 



/(^) = H 



!2w] 

Wj 



This model can also be fit with a Gibbs sampling algorithm based on the 
reparameterization of Example 10.26 (see Problem 10.39), and Figures 10.7 
and 10.8 illustrate the performances of this algorithm on a benchmark ex- 
ample introduced in Crawford et al. (1992). Figure 10.7 shows the (lack of) 
evolution of the estimated density based on the averaged parameters from 
the Gibbs sampler when the number of iterations T increases. Figure 10.8 
gives the corresponding convergence of these averaged parameters and shows 
that the stability is less obvious at this level. This phenomenon occurs quite 
often in mixture settings and is to be blamed on the weak identifiability of 
these models, for which quite distinct sets of parameters lead to very similar 
densities, as shown by Figure 10.7. || 



In the examples above, the reparameterization of mixtures and the corre- 
sponding correction of prior distributions result in a higher stability of Gibbs 
sampler algorithms and, in particular, with the disappearance of the phe- 
nomenon of trapping states 

The identifiability constraint used in (10.11) has an advantage over equiv- 
alent constraints, such as pi > P 2 > * • * > Pfc or /ii < //2 < • * * < Mfc- R not 
only automatically provides a compact support for the parameters but also 
allows the use of a uniform prior distribution. However, it sometimies slows 
down convergence of the Gibbs sampler. 

Although the new parameterization helps the Gibbs sampler to distinguish 
between the homogeneous components of the sample (that is, to identify the 
observations with the same indicator variable), this reparameterization can 
encounter difficulties in moving from homogeneous subsamples. When the or- 
der of the component indices does not correspond to the constraint (10.11), 
the probability of simultaneously permuting all the observations of two ill- 
ordered but homogeneous subsamples is very low. One solution in the normal 
case is to identify the homogeneous components without imposing the con- 
straint (10.11). The uniform prior distribution ^o,i] must be replaced 

by ^ ^^ 0 , 1 ] + \Va{2, 1), which is equivalent to assuming that either aj or 
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Fig. 10.7. Evolution of the estimation of the density (10.13) for three components 
and T iterations of a Gibbs sampling algorithm. The estimations are based on 149 
observations of acidity levels for lakes in the American Northeast, used in Crawford 
et al. (1992) and represented by the histograms. 



( 7 ~^ is distributed from a ^o,l] distribution. The simulation steps for aj are 
modified, but there exists a direct Accept-Reject algorithm to simulate from 
the conditional posterior distribution of the cr^’s. An alternative solution is to 
keep the identifiability constraints and to use a hybrid Markov chain Monte 
Carlo algorithm, where, in every U iterations, a random permutation of the 
values of Zi is generated via a Metropolis-Hastings scheme: 



Algorithm A. 45 -Hybrid Allocation Mixture Estimation- 

0. Simulate (i — 

1. Generate a random permutation on and 

derive 

z = {i){zi))i and (= {p, j) ' )j • 

[A.45] 

2. Generate conditionally on (4,5) by a standard Gibbs 
sampler iteration. 

3. Accept (4^i) with probability 

do 14 ’) 



where 7 t(^, z) denotes the distribution of (^, z). 
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M&ans 




Fig. 10.8. Convergence of estimators of the parameters of the mixture (10.13) for 
the same iterations as in Figure 10.7. 



i=l 







Wj 




< 0 . 



where 7Ti{z\^) and 7T2(^'|2;, (^) are the conditional distributions used in the 
Gibbs sampler and is the previous value of the allocation vector. Note 
that, in (10.14), 7 Ti(V’“H5)IO = 7Ti(5|0. 

This additional Metropolis-Hastings step may result in delicate compu- 
tations in terms of manipulation of indices but it is of the same order of 
complexity as an iteration of the Gibbs sampler since the latter also requires 
the computation of the sums Pj ^ The value of 

U in [AA5] can be arbitrarily fixed (for instance, at 50) and later modified 
depending on the average acceptance rate corresponding to (10.14). A high 
acceptance rate means that the Gibbs sampler lacks a sufficient number of 
iterations to produce a stable allocation to homogeneous classes; a low accep- 
tance rate suggests reduction of U so as to more quickly explore the possible 
permutations. 
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10.4.2 Rao-Blackwellization 



Some of the Rao-Blackwellization results of Section 9.3 carry over to the 
multi-stage Gibbs sampler. For example, Liu et al. (1995) are able to extend 
Proposition 9.18 to the multi-stage Gibbs sampler in the case of the random 
Gibbs sampler, where every step only updates a single component of y. In the 
setup of [^.40], define a multinomial distribution a = {ai, . . . , dp). 



Algorithm A. 46 -Random Gibbs Sampler- 
Given 

1, Select a component 

(t 

for j ^ 



2. Generate ^ 



9i^(yAyf\j 0 an'i take 



(t+1) (0 

y) - y) 



[AA6] 



Note that, although [A. 46] only generates one component of y at each iter- 
ation, the resulting chain is strongly irreducible (Definition 6.13) because of 
the random choice of z/. It also satisfies the following property. 

Proposition 10.28. The chain (y^^^) generated by [A. 46] has the property 
that for every function h G C 2 {g), the covariance cov(/i(y^^^), /i(y^^^)) is 
positive and decreasing in t. 

Proof Assume again Eg[h{Y)] = 0; then 

E[/i(r(0))/i(yW)] = E[E[/i(yW)/z(y(i))|;. = V, ^ ^;)]] 

V = 1 

= X^ ^„E[E[My(°))|(yW,i^t;)]2] 

V = 1 

= E[E[h{Y)\,.,{Yj,j^,.)]% 

due to the reversibility of the chain and the independence between y^^^ and 
y*^^\ conditionally on u and {yf\j ^ v). k simple recursion implies 

E[/i(y W)/i(y(*))] = var(E[- • • E[E[/i(y)|^, {Y^,j ^ z.)]|y] ...]), 

where the second term involves t conditional expectations, successively in 
ivjj 2^)) and in Y. □ 

This proof suggests choosing a distribution a that more heavily weights 
components with small E[h{Y)\{yj,j ^ v)]‘^, so the chain will typically visit 
components where gv{yv\yj’>3 ^ v) is not too variable. However, the oppo- 
site choice seems more logical when considering the speed of convergence to 
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the stationary distribution, since components with high variability should be 
visited more often in order to accelerate the exploration of the support of 
g. This dichotomy between acceleration of the convergence to the station- 
ary distribution and reduction of the empirical variance is typical of MCMC 
methods. 

Another substantial benefit of Rao-Blackwellization is an elegant method 
for the approximation of densities of different components of y. Since 

t=l 

is unbiased and converges to the marginal density Qi{yi) if these conditional 
densities are available in closed form, it is unnecessary (and inefficient) to use 
nonparametric density estimation methods such as kernel methods (see Fan 
and Gijbels 1996 or Wand and Jones 1995). This property can also be used 
in extensions of the Riemann sum method (see Section 4.3) to setups where 
the density / needs to be approximated (see Problem 10.17). 

Another consequence of Proposition 9.18 is a justification of the technique 
of batch sampling proposed in some MCMC algorithms (Ceyer 1992, Raftery 
and Lewis 1992, Diebolt and Robert 1994). Batch sampling involves subsam- 
pling the sequence (T^^^)t produced by a Cibbs sampling method into (F^^^^)s 
{k > 1), in order to decrease the dependence between the points of the sam- 
ple. However, Lemma 12.2 describes a negative impact of subsampling on the 
variance of the corresponding estimators. 

10.4.3 Improper Priors 

This section discusses a particular danger resulting from careless use of Metro- 
polis-Hastings algorithms, in particular the Cibbs sampler. We know that the 
Cibbs sampler is based on conditional distributions derived from f{x \ , . . . , Xg) 
OT g{yi, ... ^yp). What is particularly insidious is that these conditional distri- 
butions may be well defined and may be simulated from, but may not corre- 
spond to any joint distribution g; that is, the function g given by Lemma 10.5 
is not integrable. The same problem may occur when using a proportionality 
relation as 7t(0|x) oc 7r{6)f{x\6) to derive a Metropolis-Hastings algorithm for 
'K{0)f{x\0). 

This problem is not a defect of the Gibbs sampler, nor even a simulation® 
problem, but rather a problem of carelessly using the Cibbs sampler in a 
situation for which the underlying assumptions are violated. It is nonetheless 
important to warn the user of MCMC algorithms against this danger, because 
it corresponds to a situation often encountered in Bayesian noninformative (or 
“default”) modelings. 

The construction of the Cibbs sampler directly from the conditional dis- 
tributions is a strong incentive to bypass checking for the propriety of 

® The “distribution” g does not exist in this case. 
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especially in complex setups. In particular, the function g may well be un- 
defined in a “generalized” Bayesian approach where the prior distribution is 
“improper,” in the sense that it is a cr-finite measure instead of being a regu- 
lar probability measure (see Berger 1985, Section 3.3, or Robert 2001, Section 
1.5). 

Example 10.29. Conditional exponential distributions. The following 
model was used by Casella and George (1992) to point out the difficulty of 
assessing the impropriety of a posterior distribution through the conditional 
distributions. The pair of conditional densities 



Xi\x 2 ~ £xp{x 2 ) , X 2 \xi ~ £xp{xi) 



are well defined and are functionally compatible in the sense that the ratio 



_ xiexp(-xi X2) _ ^ 
f(x2\xi) X2exp{-X2Xi) X2 



separates into a function of x\ and a function of X 2 . By virtue of Bayes theo- 
rem, this is a necessary condition for the existence of a joint density f{xi,X 2 ) 
corresponding to these conditional densities. However, recall that Theorem 
9.3 requires that 



/ 



f{xi\X2) 

f{X2\xi) 



dxi < 00 , 



but the function x~^ cannot be integrated on M^.. Moreover, if the joint 
density, /(xi,X 2 ), existed, it would have to satisfy f{xi\x 2 ) = /(^i,^ 2 )/ 
f f(xi,X 2 )dxi. The only function /(xi,X 2 ) satisfying this equation is 



f{xi,X2) oc exp(-XiX2). 



Thus, these conditional distributions do not correspond to a joint probability 
distribution (see Problem 10.24). || 



Example 10.30. Normal improper posterior. Consider X ~ 

A Bayesian approach when there is little prior information on the parameters 
6 and a is Jeffreys (1961). This approach is based on the derivation of the prior 
distribution of (0,a) from the Fisher information of the model as J(^,cr)^/^, 
the square root being justified by invariance reasons (as well as information 
theory considerations; see Lehmann and Casella 1998 or Robert 2001, Section 
3.5.3). Therefore, in this particular case, 7 t(^, cr) = and the posterior 
distribution follows by a (formal) application of Bayes’ Theorem, 



7 t (^, a\x) 



f{x\0,a) 7T{0,a) 

f fix\0 7t(C) d( ■ 



It is straightforward to check that 
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cb) 

Fig. 10.9. First iterations of the chains (a) log(l^^^^|) and (b) log(cr^^^) for a diver- 
gent Gibbs sampling algorithm. 



/ fm<Qd( = l ^ = 00 . 

and, hence, 7 t(0, a\x) is not defined. 

However, this impropriety may never be detected as it is common Bayesian 
practice to consider only the proportionality relation between 7 t( 0, (jjx) and 
f{x\6,a) 7t( 0, cr). Similarly, a derivation of a Gibbs sampler for this model is 
also based on the proportionality relation since the conditional distributions 
n{6\a^x) and 7r(cr|0,x) can be obtained directly as 

7r{e\a,x) oc n{<j\e,x) oc 

which lead to the full conditional distributions AA(x, and Sxp{{6 - x)^/2) 
for 6 and cr“^, respectively. 

This practice of omitting the computation of normalizing constants is jus- 
tified as long as the joint distribution does exist and is particularly useful 
because these constants often cannot be obtained in closed form. However, 
the use of improper prior distributions implies that the derivation of condi- 
tional distributions from the proportionality relation is not always justified. 
In the present case, the Gibbs sampler associated with these two conditional 
distributions is 



1. Simulate ; 

2. Simulate s ~ £xp — x)^/2) and take 

= 1/v^. 
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For an observation x = 0, the behavior of the chains and produced 
by this algorithm is shown in Figure 10.9, which exhibits an extremely fast 
divergence^^ to infinity (both graphs are plotted in log scales). || 



The setup of improper posterior distributions and of the resulting behav- 
ior of the Gibbs sampler can be evaluated from a theoretical point of view, 
as in Robert and Casella (1996, 1998). For example, the associated Markov 
chain cannot be positive recurrent since the cr-finite measure associated with 
the conditional distributions given by Lemma 10.5 is invariant. However, the 
major task in such settings is to come up with indicators to fiag that some- 
thing is wrong with the stationary measure, so that the experimenter can go 
back to his/her prior distribution and check for propriety. 

Given the results of Example 10.30, it may appear that a simple graph- 
ical monitoring is enough to exhibit deviant behavior of the Gibbs sampler. 
However, this is not the case in general and there are many examples, some of 
which are published (see Casella 1996), where the output of the Gibbs sampler 
seemingly does not differ from a positive recurrent Markov chain. Often, this 
takes place when the divergence of the posterior density occurs “at 0,” that 
is, at a specific point whose immediate neighborhood is rarely visited by the 
chain. Robert and Casella (1996) illustrate this seemingly acceptable behavior 
on an example initially treated in Gelfand et al. (1990). 

Example 10.31. Improper random effects posterior. Consider a ran- 
dom eflFects model. 



Yij — (3 + Ui 6ij^ i — 1, . . . , j — 1, . . . , J, 

where Ui ~ A/’(0, cr^) and Sij ~ A/’(0,r^). The Jeffreys (improper) prior for 
the parameters /3, cr, and r is 






The conditional distributions 



P\u,y,cr'^,T^ -^M{y - u,t‘^/JI) , 



T^\u, p, y, a^^ig\ IJ/2, (1/2) ^ m - P) 



10 



The impropriety of this posterior distribution is hardly surprising from a statis- 
tical point of view since this modeling means trying to estimate both parameters 
of the distribution a^) from a single observation without prior information. 
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Fig. 10.10. Sequence and corresponding histogram for the random effects 

model. The measurement scale for the is on the right and the scale of the 

histogram is on the left. 



are well defined and a Gibbs sampling can be easily implemented in this 
setting. Figure 10.10 provides the sequence of the produced by [A AO] and 
the corresponding histogram for 1000 iterations. The trend of the sequence 
and the histogram do not indicate that the corresponding “joint distribution” 
does not exist (Problem 10.25). || 

Under some regularity conditions on the transition kernel, Robert and 
Casella (1996) have shown that if there exist a positive function b,e>0 and 
a compact set C such that h{x) < e for x G the chain satisfies 

1 ^ ^ ^ 

(10.15) liminf - h{y^^^) = 0. 

t— >4-00 t 

S = 1 

A drawback of the condition (10.15) is the derivation of the function 5, whereas 
the monitoring of a liminf is delicate in a simulation experiment. Other refer- 
ences on the analysis of improper Gibbs samplers can be found in Besag et al. 
(1995) and in the subsequent discussion (see, in particular, Roberts and Sahu 
1997), as well as in Natarajan and McCulloch (1998). 



10.5 Problems 



10.1 Propose a direct Accept-Reject algorithm for the conditional distribution of Y 2 
given yi in Example 10.1. Compare with the Gibbs sampling implementation 
in a Monte Carlo experiment. {Hint: Run parallel chains and compute 90% 
confidence intervals for both methods.) 
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10.2 Devise and implement a simulation algorithm for the Ising model of Example 

10 . 2 . 

10.3 In the setup of Example 10.17: 

(a) Evaluate the numerical value of the boundary constant e derived from 
(10.5), given the data in Table 10.1. 

(b) Establish the lower bound in (10.5). {Hint Replace by 0 in the inte- 
grand [but not in the exponent] and do the integration.) 

10.4 For the data of Table 10.1 

(a) Estimate Ai, . . . , Aio using the model of Example 10.17. 

(b) Estimate a and b in the loglinear model log A = using a Gibbs sampler 

and the techniques of Example 2.26. 

(c) Compare the results of the analyses. 

10.5 (Besag 1994) Consider a distribution / on a finite state-space X of dimension 
k and conditional distributions /i, . . . , /^ such that for every (x,y) G there 
exist an integer m and a sequence xq = x, . . , , Xm — y where Xi and Xi+i only 
differ in a single component and f{xi) > 0. 

(a) Show that this condition extends Hammersley-Clifford Theorem (Theorem 
10.5) by deriving f{y) as 

(b) Deduce that irreducibility and ergodicity holds for the associated Gibbs 
sampler. 

(c) Show that the same condition on a continuous state-space is not sufficient 
to ensure ergodicity of the Gibbs sampler. {Hint: See Example 10.7.) 

10.6 Compare the usual demarginalization of the Student’s t distribution discussed 
in Example 10.4 with an alternative using a slice sampler, by computing the 
empirical cdf at several points of interest for both approaches and the same 
number of simulations. {Hint Use two uniform dummy variables.) 

10.7 A bound similar to the one in Example 10.17, established in Problem 10.3, 
can be obtained for the kernel of Example 10.4; that is, show that 



yip 



+ ^ exp U 1 + (^ - 



> [1 + {e’ - 0o)"] 
{Hint Establish that 



27t V l-trj 

expj-^ (1 + (6» - 0o)^)} drt 






exp 



9oT] \ 1 -f ry ^ ^ 

6 — — ^ > > exp 



- 26 ' Oo + el)^ 



and integrate.) 

10.8 (Robert et al. 1997) Consider a distribution / on to be simulated by Gibbs 
sampling. A one-to-one transform ^ on is called a parameterization. The 
convergence of the Gibbs sampler can be jeopardized by a bad choice of param- 
eterization. 
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(a) Considering that a Gibbs sampler can be formally associated with the full 
conditional distributions for every choice of a parameterization, show that 
there always exist parameterizations such that the Gibbs sampler fails to 
converge. 

(b) The fact that the convergence depends on the parameterization vanishes if 

(i) the support of / is arcwise connected, that for every {x^y) G (supp(/))^, 
there exists a continuous function (/? on [0, 1] with (/?(0) = (p{l) = 

and (^([0, 1]) C supp(/) and (ii) the parameterizations are restricted to be 
continuous functions. 

(c) Show that condition (i) in (b) is necessary for the above property to hold. 
(Hint: See Example 10.7.) 

(d) Show that a rotation of tt/ 4 of the coordinate axes eliminates the irreducibil- 
ity problem in Example 10.7. 

10.9 In the setup of Example 5.13, show that the Gibbs sampling simulation of the 
ordered normal means Oij can either be done in 7 x J conditional steps or in only 
two conditional steps. Conduct an experiment to compare both approaches. 

10.10 The clinical mastitis data described in Example 10.15 is the number of cases 
observed in 127 herds of dairy cattle (the herds are adjusted for size). The data 
are given in Table 10.3. 
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Table 10.3. Occurrences of clinical mastitis in 127 herds of dairy cattle. {Source: 
Schukken et al. 1991.) 



(a) Verify the conditional distributions for \i and /3i. 

(b) For these data, implement the Gibbs sampler and plot the posterior density 
7t(Ai|x, a). (Use the values a = .1, a = 6 = 1.) 

(c) Make histograms and monitor the convergence of A5, A15, and /3i5. 

(d) Investigate the sensitivity of your answer to the specification of a, a and b. 

10.11 (a) For the Gibbs sampler of [A.40], if the Markov chain satisfies the 

positivity condition (see Definition 9.4), show that the conditional distribu- 
tions p(2/i|2/i, 2/2, . . • , 2/i-i, 2/i+i , . . . ,yp) will not reduce the range of possible 
values of Yi when compared with g. 

(b) Prove Theorem 10.8. 

10.12 Show that the Gibbs sampler of Algorithm [A. 41] is reversible. (Consider a 
generalization of the proof of Algorithm [A. 36].) 

10.13 For the hybrid algorithm of [A. 43], show that the resulting chain is a Markov 
chain and verify its stationary distribution. Is the chain reversible? 
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10.14 Show that the model (10.8) is not identifiable in (/3, cr). {Hint: Show that it 
only depends on the ratio (3 /a.) 

10.15 If K{x, x) is a Markov transition kernel with stationary distribution p, show 
that the Metropolis-Hastings algorithm where, at iteration t, the proposed value 
yt ~ K{x^^\y) is accepted with probability 

fjyt) 

/(a;(‘)) g{yt) 

provides a valid MCMC algorithm for the stationary distribution /. {Hint: Use 
detailed balance.) 

10.16 For the algorithm of Example 7.13, an obvious candidate to simulate 7r(a|x, y, h) 
and 7t(6|x, y, a) is the ARS algorithm of Section 2.4.2. 

(a) Show that 



log7r(o|x,y,6) = - ^2/i log (^1 + 



a-\-bxi ^2 



2ct2 ’ 



and, thus, 

^ log 7r(a|x, y , 6) = j/i - 2/i - 

i i 

and 



a+bxi 






-E 



a-{-bxi 



(1 + ga + 6x^^2 fj2 



a" 

aa2 



log7r(a|x,y,6) = - y^ 



Vie 



a+bxi 



(1 + 



■E 



CL-\-bxj^ 2(o,-t-6x^) 

e c- _2 



(1 + 



= -E 



a+bxi 



(1 + 






Argue that this last expression is not always negative, so the ARS algorithm 
cannot be applied. 

(b) Show that 



db^ 



log7r(6|x,y,a) 

^ (X I bcc ^ _ 

= “ E n +e°+i>"i)3 ~ } xl- T~^ 

i ^ ^ 



and deduce there is also no log-concavity in the b direction. 

(c) Even though distributions are not log-concave, they can be simulated with 
the ARMS algorithm [A. 28]. Give details on how to do this. 

10.17 Show that the Riemann sum method of Section 4.3 can be used in conjunc- 
tion with Rao-Blackwellization to cover multidimensional settings. {Note: See 
Philippe and Robert 1998a for details.) 

10.18 The tobit model is used in econometrics (see Tobin 1958). It is based on a 
transform of a normal variable y* ^ J\f{x\/3, {i = 1, ... ,n) by truncation 



yi = max(?/*,0). 
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Show that the following algorithm provides a valid approximation ol’ the poste- 
rior distribution of (/3, cr^). 

1 . Simulate y* ^ (7^, 0) if yi = 0, 

2. Simulate {/?, cr) ^ x) with 



7r{f3,(T\y*yX) "exp 






xll3f/2a^ ^ ^(/?,cr) 



10.19 Let {Xn) be a reversible stationary Markov chain (see Definition 6.44) with 
E[Xi] = 0 . 

(a) Show that the distribution of Xk\X± is the same as the distribution of 
XilXfc. 

(b) Show that the covariance between alternate random variables is positive. 
More precisely, show that cov{Xq, X 2 u) = E[E(Xo|Xi,)^] > 0 . 

(c) Show that the covariance of alternate random variables is decreiasing; that 
is, show that cov{Xo, X 2 u) > cov{Xo, X 2 (u+i)- {Hint: Use the fact that 



E[E(Xo|X.)"] = var[E(Xo|X.)] > var[E{E(Xo|X.)|X.+i}], 



and show that this latter quantity is cov(Xo, X 2 (i/+i).) 

10.20 Show that a Gibbs sampling kernel with more than two full conditional steps 
cannot produce interleaving chains. 

10.21 In the setup of Example 10 . 2 : 

(a) Consider a grid in with the simple nearest-neighbor relation J\f (that is, 
(z, j) G A/^ if and only if min(|ii — ji|, \i 2 — J 2 I) = 1). Show that a Gibbs 
sampling algorithm with only two conditional steps can be implemented in 
this case. 

(b) Implement the Metropolization scheme of Liu (1995) discussed in Section 
10.3.3 and compare the results with those of part (a). 

10.22 (Gelfand et al. 1995) In the setup of Example 10.25, show that the original 
parameterization leads to the following correlations: 

( lal \ ( Icrl \ ^ 



for i 7 ^ j. 

10.23 (Continuation of Problem 10.22) For the hierarchical parameterization, show 
that the correlations are 







P'HiiVj 




for i / j. 

10.24 In Example 10.29, we noted that a pair of conditional densities is functionally 
compatible if the ratio fi{x\y)/f 2 {y\x) = hi{x)/Ji 2 {y), for some functions hi 
and /i 2 . This is a necessary condition for a joint density to exist, but not a 
sufficient condition. If such a joint density does exist, the pair fi and /2 would 
be compatible. 

(a) Formulate the definitions of compatible and functionally compatible for a 
set of densities /i , . . . , /m • 
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(b) Show that if /i, . . . , /m are the full conditionals from a hierarchical model, 
they are functionally compatible. 

(c) Prove the following theorem, due to Robert and Casella (1998), which shows 
there cannot be any stationary probability distribution for the chain to 
converge to unless the densities are compatible. 

Theorem 10.32. Let /i, ... ,/m be a set of functionally compatible condi- 
tional densities on which a Gibbs sampler is based. The resulting Markov 
chain is positive recurrent if and only if fi^ ^ fm are compatible. 

(Note: Compatibility of a set of densities was investigated by Besag 1974, Arnold 
and Press 1989, and Gelman and Speed 1993.) 

10.25 For the situation of Example 10.31, 

(a) Show that the full “posterior distribution” of is 



7t(/ 5, r^| 2 /) oc cr 2 / ^ 2 jj 



X exp 



1 2(r2 + J<t 2) j 






(Hint: Integrate the (unobservable) random effects Ui.) 

(b) Integrate out (3 and show that the marginal posterior density of (<j^, r^|) is 



X exp 



I 2 t 2 ^ y^^ 2(r2 + J(t2) ^ I 



(c) Show that the full posterior is not integrable since, for r 7^ 0, 7r(cr^,r^|y) 
behaves like (t~^ in a neighborhood of 0. 

10.26 (Chib 1995) Consider the approximation of the marginal density m(x) = 
/(x|^)7r(^)/7r(^|a:), where 6 = (^i,...,^s) and the full conditionals 7r(^r|a:,^s, 
s ^ r) are available. 

(a) In the case B = 2, show that an appropriate estimate of the marginal 
loglikelihood is 



i(x) = log f(x\0l) + log 7Ti(0l) - log 






where 01 is an arbitrary point, assuming that f(x\0) = f(x\0i) and that 
7Ti(0i) is available. 

(b) In the general case, rewrite the posterior density as 

7t(0\x) = 7ri(0l\x)7T2(02\x,0i) • " 7T B (0 b\x , 0l , . . . ,^S-l) • 

Show that an estimate of 7Tr(^*|x, . . . , 0*_i) is 



1 ^ 

TCr(e:\x,e*s,s < r) = - , 



where the {£ > r) are simulated from the full conditionals TTe{de\di, 
. . . , 0*,9r-yi , . . . , Ob). 
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(c) Deduce that an estimate of the joint posterior density is Ylf=i '^r{0r\x,6s, 
s < r) and that an estimate of the marginal loglikelihood is 

B 

i(x) = log /(x|6>*)+log 7r(0*)-^log 7Tr{0r\x,9l,S < r) 

r=l 

for an arbitrary value 0* . 

(d) Discuss the computational cost of this method as a function of B. 

(e) Extend the method to the approximation of the predictive density f{y\x) = 

I f{y\^)'^Wx)de . 

10.27 (Fishman 1996) Given a contingency table with cell sizes Nij, row sums A^i., 
and column sums N.j , the chi squared statistics for independence is 

.2 v^ (Wij-«.W,/iV)= 

h ' 

Assuming fixed margins Ni. and N.j {i — 1, . . . , 7, j = 1, . . . , J), the goal is to 
simulate the distribution of x^- Design a Gibbs sampling experiment to simu- 
late a contingency table under fixed margins. {Hint: Show that the vector to 
be simulated is of dimension (7 — 1)(J — 1).) {Note: Alternative mesthods are 
described by Aldous 1987 and Diaconis and Sturmfels 1998.) 

10.28 The one-way random effects model is usually written 

Yij = y,-\-ai^£ij, Qi ~ A/'(0, cr^), £ij ^ Af{0,a^), j = l,...,ni, i = 1, . . . , 

(a) Show it can also be written as the hierarchical model 

ai ~ Af{0,al) . 



(b) Show the joint density is 



i ^ ^ ij 



{Note: This is a complete data likelihood if we consider the ai’s to be missing 
data.) 

(c) Show that the full conditionals are given by 






mcrl , 

2 I 2 M)? ■ 2 I 2 

Uiai -h -h 



y\y,a,crl,(T'^ ~ ^{y~°‘^'^^ > 



cr^ 

2 | 2 
cr |y,a,/i,o-c, 




where N = n* and XQ{a,b) is the inverted gamma distribution (see 
Appendix A). 
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(d) The following data come from a large experiment to assess the precision of 
estimation of chemical content of turnip greens, where the leaves represent 
a random effect (Snedecor and Cochran 1971). Run a Gibbs sampler to 
estimate the variance components, plot the histograms and monitor the 
convergence of the chain. 



Leaf 


1 % Ca 


1 


3.28 


3.09 


3.03 


3.03 


2 


3.52 


3.48 


3.38 


3.38 


3 


2.88 


2.80 


2.81 


2.76 


4 


3.34 


3.38 


3.24 


3.26 



Table 10.4. Calcium concentration (%) in turnip greens {Source: Snedecor and 
Cochran 1971). 



10.29 t^^(Gelfand et al. 1990) For a population of 30 rats, the weight yij of rat i 
is observed at age Xj and is associated with the model 

Yij ^ M{ai + !3i{xj -x),al). 



(a) Give the Gibbs sampler associated with this model and the prior 
ai ~ M{ac, (Ja), Pi M{Pc, (tI), 
with almost flat hyperpriors 

ac , Pc ~ A/'(0, 10^), cr“^, cr~^, ^ 0a(lO“^, 10~^). 



(b) Assume now that 

Yij ~ N{pii + P2iXj,al) , 

with Pi = {Pii^ p2i) ^ M2{lJi(3-> ^(3)- Using a Wishart hyperprior W(2,R) on 
i?/ 3 , with 

^ ^200 0 \ 

^ V 0 0.2 J ’ 



give the corresponding Gibbs sampler. 

(c) Study whether the original assumption of independence between p\i and 
p 2 i holds. 

10.30 t(Spiegelhalter et al. 1995a,b) Binomial observations 



Ri ~ B{rii,pi), z = 1, . . . , 12, 



correspond to mortality rates for cardiac surgery on babies in hospitals. When 
the failure probability pi for hospital i is modeled by a random effect structure 

logit(pi) ~ 

Problems with this dagger symbol are studied in detail in the BUGS manual of 
Spiegelhalter et al. (1995a,b) . Corresponding datasets can also be obtained from 
this software (see Note 10.6.2). 
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with almost flat hyperpriors 

|i~AA(0,10^), ~e?a(10“®,10“®), 

examine whether a Gibbs sampler can be implemented. {Hint: Consider the 
possible use of ARS as in Section 2.4.2.) 

10.31 t(Boch and Aitkin 1981) Data from the Law School Aptitude Test (LSAT) 
corresponds to multiple-choice test answers {yji , . . . ,yjb) in {0, 1}^. The yjk^s 
are modeled as B{pjk), with (j = 1, . . . , 1000, /c = 1, . . . , 5) 

logit(pjfc) = Oj -ak, dj ^ ^f{0, (7^). 

(This is the Rasch model) Using vague priors on the a^’s and give the 
marginal distribution of the probability Pi to answer i G {0, 1}^. {Hint: Show 
that Pi is the posterior expectation of the probability Pi\e conditional on ability 
level.) 

10.32 t(Dellaportas and Smith 1993) Observations ti on survival time are related 
to covariates Zi by a Weibull model 

Ti\zi ^ We(r,/Xi), l^i = exp(/3*2:i), 

with possible censoring. The prior distributions are 

r ~ ea(l, 10““). 

(a) Construct the associated Gibbs sampler and derive the posterior expecta- 
tion of the median, log(2 exp(— 

(b) Compare with an alternative implementation using a slice sampler [A. 32] 

10.33 t(Spiegelhalter et al. 1995a,b) In the study of the effect of a drug on a heart 
disease, the number of contractions per minute for patient i is recorded before 
treatment (xi) and after treatment {yi). The full model is Xi ^ P(Ai) and for 
uncured patients, Yi ~ V{f3\i), whereas for cured patients, yi = 0. 

(a) Show that the conditional distribution of Yi given U = XiPyi is B{U, P/{1 + 
(3)) for the uncured patients. 

(b) Express the distribution of the Yi’s as a mixture model and derive the Gibbs 
sampler. 

10.34 t(Breslow and Clayton 1993) In the modeling of breast cancer cases yi ac- 
cording to age Xi and year of birth di, an exchangeable solution is 

- V{fM), 

log(pi) = log(di) + Oxi + Pdi, 

Pk ~ 

(a) Derive the Gibbs sampler associated with almost flat hyperpriors on the 
parameters aj,/3k, and a. 

(b) Breslow and Clayton (1993) consider a dependent alternative where for 
A: == 3, . . . , 11 we have 

(10.16) f3k\/3i,...,/3k-i ^Af{2/3k-i -/5fc-2,o-^), 



while /3i,p2 ~ A/^(0, 10^(7^). Construct the associated Gibbs sampler and 
compare with the previous results. 
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(c) An alternative representation of (10.16) is 

Determine the value of Pk and compare the associated Gibbs sampler with 
the previous implementation. 

10.35 t (Dobson 1983) The effect of a pesticide is tested against its concentration 
Xi on rii beetles, Ri ~ B(rii^pi) of which are killed. Three generalized linear 
models are in competition: 

— exp(a + Pxj) 

1 exp(a (3xi)' 

Pi = ^{exp{a + l3xi)), 

Pi = l - exp(- exp(o; + Pxi )) ; 

that is, the logit, probit, and log- log models, respectively. For each of these 
models, construct a Gibbs sampler and compute the expected posterior deviance] 
that is, the posterior expectation of 



D = 2 






where 

£i = n log(pi) 4- {ui - Pi) log(l - Pi), ii = max^i. 

Pi 

10.36 t(Spiegelhalter et al. 1995a,b) Consider a standard Bayesian ANOVA model 
(i = = 1, ...,5) 



Yij ~ A/" {nij , cr ) , 

P'ij — OLi H" j3j , 

ai ~ AA(0,a«), 

Pj ^ •A/'(0,C7^), 

~ Qa{a, 6 ), 

with = 5 and a = 0, 6 = 1. Gelfand et al. (1992) impose the constraints 

q;i > . . . > «4 and /^i < • • • < ^3 > • • • > ft. 

(a) Give the Gibbs sampler for this model. {Hint: Use the optimal truncated 
normal Accept-Reject algorithm of Example 2.20.) 

(b) Change the parameterization of the model as (i = 2, . . . , 4, j = 1, 2, 4, 5) 

OLi — OLi—\ -f- Ci, €i ^ 0, 

Pj = ft - r]j, T]j > 0, 

and modify the prior distribution in 

ai ^ A/'(0, al), (33 ^ A/'(0, al), a, pj ^ ^a(0, 1). 

Check whether the posterior distribution is well defined and compare the 
performances of the corresponding Gibbs sampler with the previous imple- 
mentation. 

10.37 In the setup of Note 10.6.3: 
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(a) Evaluate the time requirements for the computation of the exact formulas 
for the /ii — /Xi+i. 

(b) Devise an experiment to test the maximal value of X(n) which can be pro- 
cessed on your computer. 

(c) In cases when the exact value can be computed, study the convergence 
properties of the corresponding Gibbs sampler. 

10.38 In the setup of Note 10.6.3, we define the canonical moments of a distribution 
and show that they can be used as a representation of this distribution. 

(a) Show that the two first moments and fi 2 are related by the following two 
inequalities: 

Ml < M2 < Ml ? 

and that the sequence (/ifc) is monotonically decreasing to 0. 

(b) Consider a /cth-degree polynomial 

k 

Pk{x) = y^aix\ 



Deduce from 

that 

(10.17) 

where 



f Pk {x)dG{x) dx>0 

Jo 



oJ'Cka >0, Va G 
/I 111 M2 • • • 



Ck = 



Ml M2 M3 
VmA: Mfc + 1 



fJ'k \ 
Mfc+1 



M2fe / 



and a^ = (ao, ai, . . . , a^). 

(c) Show that for every distribution g, the moments fik satisfy 



1 


Ml 


M2 . 


. . Mfc 




Ml 


M2 


M3 . 


. . Mfc+i 


> 0. 


Mfc M^+i 




• • M2fc 





[Hint: Interpret this as a property of Ck-) 

(d) Using inequalities similar to (10.17) for the polynomials t[l—t)Pk[t) , tP^it), 
and (1 — t)P^(t), derive the following inequalities on the moments of G: 



Ml - M2 M2 - M3 . . . Mfc-i - l^k 
/X2 — M3 M3 — M4 • • • n-k — Mfc+1 



> 0 , 



Mfc-1 — Mfc 



• . • M2fc-i — M2fc 
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Ml 


M2 • 


. Pk 




M2 


M3 . 


. Mfc+i 


> 0 


Mfc 


Mfc+i • 


• • M2fc-1 





1 - Ml Ml - M2 • • • Mfc-i - Mfc 
Ml - M2 M2 - M3 • • • Mfc - Mfc+i 



Mfc-i “ Mfc 



• • M2fc-2 — M2fc-i 



> 0. 



(e) Show that the bounds in parts (c) and (d) induce a lower (upper) bound 
0 . 2 k (^ 2 fc) on M 2 fc and that part (c) [(d)] induces a lower (upper) bound C 2 ^_i 
(c 2 fc-i) on M 2 fc-i* 

(f) Defining pk as 

Cfc — 

Pk = z 

Cfc -Cf^ 

show that the relation between (pi, ...,pn) and (mi, p^n) is one-to-one for 
every n and that the pi are independent. 

(g) Show that the inverse transform is given by the following recursive formulas. 
Define 

qi = l-pi, Ci=pi, Ci=Piqi-i (^> 2 ). 

then 

Si,k = Cl + ■ ‘ + Cfc (^ ^ 1)? 

Cn — Sn,n • 

{Note: See Dette and Studden 1997 for details and useful complements on canon- 
ical moments.) 

10.39 The modification of [A. 38] corresponding to the reparameterization discussed 
in Example 10.26 only involves a change in the generation of the parameters. 
In the case k = 2, show that it is given by 



Simulate 



pIm, 0,a,T ^ Be{ni + 1, ri 2 + 1); 



0\p, (j,T,p J\f 






(^2 - m) 



1 + 



ri2 



ri2 + 



nixi a ^ti2{x2 — t9) 
n\ -h ri 2 cr~‘^ 



7l2(j‘^ + ri2 



(10.18) 






2r2 



<t2 < 1 ? 



n 1 



1m,6>,o-,p ~ 0a ( ( Si -hui(xi - m) + 



x2 , «2 . n2{X2 - (J-f 



Tl2C^ + cr2 



10.40 (Gruet et al. 1999) As in Section 10.4.1, consider a reparameterization of a 
mixture of exponential distributions, Pj ^xp{Xj). 
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(a) Use the identifiability constraint, Ai > A 2 > • • • > A^, to write the mixture 
with the “cascade” parameterization 

fc-i 

p 8xp{\) + ~ p) ’ ‘ ' Qj Sxp{Xai • • • CTj) , 

with qk-i = 1, (Ji < 1, . . . , CTfc-i < 1. 

(b) For the prior distribution 7 t(A) — Qj ^ ^ ^o,i]? show that 

the corresponding posterior distribution is always proper. 

(c) For the prior of part (b), show that [A. 34] leads to the following algorithm: 

Algorithm A. 47 -Exponential Mixtures- 

2. Generate 

A ^ Qa (n, + criUiXi + ■ - ' + ai ■ ■ t 

<T\ ^ Qcl (fii + ’ ■ * Hh j A {riixi + cf2T123^2 + ■ * * + 

<72 ■ 

p ^ Be(no + 1, u. — 710 + 1)^ 

qk^2 ^ Be{nk-2 + li n.*-! + 1), 

where no, ni, . . . , nfc-i denote the size of subsamples allocated to the 
components Exp{\)^ Sxp{\ai)^ . . ., Sxp{\(7\ . . . cTk-i) and no^o,uiXi, . . . , 
Uk-iXk-i are the sums of the observations allocated to these components. 

10.6 Notes 

10.6.1 A Bit of Background 

Although somewhat removed from statistical inference in the classical sense and 
based on earlier techniques used in statistical Physics, the landmark paper by Ge- 
man and Geman (1984) brought Gibbs sampling into the arena of statistical ap- 
plication. This paper is also responsible for the name Gibbs sampling, because it 
implemented this method for the Bayesian study of Gibbs random fields which, in 
turn, derive their name from the physicist Josiah Willard Gibbs (1839-1903). This 
original implementation of the Gibbs sampler was applied to a discrete image pro- 
cessing problem and did not involve completion. 

The work of Geman and Geman (1984) built on that of Metropolis et ai. (1953) 
and Hastings (1970) and his student Peskun (1973), influenced Gelfand and Smith 
(1990) to write a paper that sparked new interest in Bayesian methods, statistical 
computing, algorithms, and stochastic processes through the use of computing al- 
gorithms such as the Gibbs sampler and the Metropolis-Hastings algorithm. It is 
interesting to see, in retrospect, that earlier papers had proposed similar solutions 
but did not And the same response from the statistical community. Among these, 
one may quote Besag (1974, 1986), Besag and Clifford (1989), Broniatowski et al. 
(1984), Qian and Titterington (1990), and Tanner and Wong (1987). 
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10.6.2 The BUGS Software 



The acronym BUGS stands for Bayesian inference using Gibbs sampling. This soft- 
ware was developed by Spiegelhalter et al. ( 1995a, b,c) at the MRC Biostatistics 
Unit in Cambridge, England. As shown by its name, it has been designed to take 
advantage of the possibilities of the Gibbs sampler in Bayesian analysis. BUGS in- 
cludes a language which is C or R like and involves declarations about the model, the 
data, and the prior specifications, for single or multiple levels in the prior modeling. 
For instance, for the benchmark nuclear pump failures dataset of Example 10.17, 
the model and priors are defined by 

for Ci in 1:N) { 

theta [i] dgamma (alpha, beta) 
lambda [i] theta [i] * t[i] 
x[i] ^ dpois (lambda [i] ) 

} 

alpha ^ dexp(l.O) 

beta ^ dgammaCO, 1, 1 .0) 

(see Spiegelhalter et al. 1995b, p.9). Most standard distributions are recognized by 
BUGS (21 are listed in Spiegelhalter et al. 1995a), which also allows for a large range 
of transforms. BUGS also recognizes a series of commands like compile, data, out, 
and St at. The output of BUGS is a table of the simulated values of the parameters 
after an open number of warmup iterations, the batch size being also open. 

A major restriction of this software is the use of the conjugate priors or, at least, 
log-concave distributions for the Gibbs sampler to apply. However, more complex 
distributions can be handled by discretization of their support and assessment of the 
sensitivity to the discretization step. In addition, improper priors are not accepted 
and must be replaced by proper priors with small precision, like dnorm(0, 0.0001), 
which represents a normal modeling with mean 0 and precision (inverse variance) 
0 . 0001 . 

The BUGS manual (Spiegelhalter et al. 1995a) is quite informative and well 
written. In addition, the authors have compiled a most helpful example manual 
(Spiegelhalter et al. 1995b, 1996), which exhibits the ability of BUGS to deal with an 
amazing number of models, including meta-analysis, latent variable, survival anal- 
ysis, nonparametric smoothing, model selection and geometric modeling, to name 
a few. (Some of these models are presented in Problems 10.29-10.36.) The BUGS 
software is also compatible with the convergence diagnosis software CODA presented 
in Note 12.6.2. 



10.6.3 Nonparametric Mixtures 

Consider Ai , . . . , Xn distributed from a mixture of geometric distributions. 

At the present time, that is. Spring 2004, the BUGS software is available as free- 
ware on the Web site http : / / www . mr c-bsu . cam . ac . uk/bugs for a wide variety of 
platforms. 





10.6 Notes 421 



[\^l-0)dG{0), Xi^N, 

Jo 

where G is an arbitrary distribution on [0,1]. In this nonparametric setup, the 
likelihood can be expressed in terms of the moments 

Mi = / e^dG{0), 2 = 1,... 

Jo 

since G is then identified by the /ii’s. The likelihood can be written 

n 

i=l 

A direct Bayesian modeling of the /ii’s is impossible because of the constraint be- 
tween the moments, such as /ii > Mi ^ 2), which create dependencies between the 
different moments (Problem 10.38). The canonical moments technique (see Olver 
1974 and Dette and Studden 1997, can overcome this difficulty by expressing the 
Mi’s as transforms of a sequence (pj) on [0, 1] (see Problem 10.38). Since the pj’s are 
not constrained, they can be modeled as uniform on [0, 1]. The connection between 
the Mi’s and the pj’s is given by recursion equations. Let qi = 1 — Pi, Ci ~ Pu ^nd 
Ci = PiQi+i^ (^ > 2), and define 

Sl,k— Cl T ■ * ‘ + Cky {k > 1) 

Sj.k= CuSj-uu+j-i, {j > 2). 

It is then possible to show that Mj ~ P'j+i = QiSj,j (see Problem 10.38). 

Prom a computational point of view, the definition of the Mi’s via recursion 
equations complicates the exact derivation of Bayes estimates, and they become too 
costly when max Xi > 5. This setting where numerical complexity prevents the 
analytical derivation of Bayes estimators can be solved via Gibbs sampling. 

The complexity of the relations Sj^k is due to the action of the sums in the 
recursion equations; for instance, 

M3 - M4 = qiP2q2{piq2{piq2 -\~ P2qs) P2qs{piq2 -\-p 2 q 3 +Psq4)}- 

This complexity can be drastically reduced if, through a demarginalization (or com- 
pletion) device, every sum Sj^k in the recursion equation is replaced by one of its 
terms Cu*S'j-i,u+j-i (1 < u < /c — j + 1). In fact, p.k — Mfc+i is then a product of pi’s 
and gj’s, which leads to a beta distribution on the parameters pi. To achieve such 
simplification, the expression 

P{Xi = k) = pik — Mfc+i = Ci*^fc-i,fc 

= Cl{Cl*^fc-2,fc-l + C2Sk-2,k} 

can be interpreted as a marginal distribution of the Xi by introducing Z\ G {0, 1} 
such that 



P{Xi = k,Z\=Q) = CiSk-2,k-i, 

P(Xi=k,Z{=l) = ^l^2Sk-2,k. 

Then, in a similar manner, introduce Z\ G {0,1,2} such that the density of 
(Xi, ZJ, Z 2 ) (with respect to counting measure) is 
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f{Xi,Zi,Z2) — C'i-Czi + lCz'^-\-l^Xi-3,Xi-2-\-z^' 

The replacement of the Sk,j^s thus requires the introduction of {k — 1) variables Z] 
for each observation Xi — k. Once the model is completed by the ZJ’s, the posterior 
distribution of pj , 

n 

n (9iPi92P^;+i9z|+2 ■ • • 94^_i+2). 

i=l 

is a product of beta distributions on the pj’s, which are easily simulated. 

Similarly, the distribution of Zl conditionally on pj and the other dummy vari- 
ables Zr (r / s) is given by 

=W, = w) (X Pwqw+\lzi=^-l H V Pv+2qv+zlzi=v+l- 

The Gibbs sampler thus involves a large number of (additional) steps in this case, 
namely 1 -h — 1) simulations, since it imposes the “local” generation of the 

z\'s. In fact, an arbitrary grouping of z\ would make the simulation much more 
difficult, except for the case of a division of - • ■ , into two subvectors 

corresponding to the odd and even indices, respectively. 

Suppose that the parameter of interest is p = (pi, • • • ,Pic+i), where K is the 
largest observation. (The distribution ofpi for indices larger than K+lis unchanged 
by the observations. See Robert 2001.) Although p is generated conditionally on the 
complete data (xi,z'^) (i = l,...,n), this form of Gibbs sampling is not a Data 
Augmentation scheme since the z^^s are not simulated conditionally on 6, but rather 
one component at a time, from the distribution 

f{zl\p,zl~l =w, 2:1+1 =w) (X p^qn,+ llzi=zo-l Pv+2qv+3lzl=v+l- 

However, this complexity does not prevent the application of Theorem 9.13 since the 
sequence of interest is generated conditionally on the zj’s. Geometric convergence 
thus applies. 

10.6.4 Graphical Models 

Graphical models use graphs to analyze statistical models. They have been developed 
mainly to represent conditional independence relations, primarily in the field of 
expert systems (Whittaker 1990, Spiegelhalter et al. 1993). The Bayesian approach 
to these models, as a way to incorporate model uncertainty, has been aided by the 
advent of MCMC techniques, as stressed by Madigan and York (1995) an expository 
paper on which this note is based. 

The construction of a graphical model is based on a collection of independence 
assumptions represented by a graph. We briefly recall here the essentials of graph 
theory and refer to Lauritzen (1996) for details. A graph is defined by a set of vertices 
or nodes, a 6 V, which represents the random variables or factors under study, and 
by a set of edges, {a, (3) G V^, which can be ordered (the graph is then said to be 
directed) or not (the graph is undirected). For a directed graph, a is a parent of f3 if 
(a,/3) is an edge (and j3 is then a child of a).^^ Graphs are also often assumed to 

Directed graphs can be turned into undirected graphs by adding edges between 
nodes which share a child and dropping the directions. 
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be acyclic] that is, without directed paths linking a node a with itself. This leads to 
the notion of directed acyclic graphs, introduced by Kiiveri and Speed (1982), often 
represented by the acronym DAG. 

For the construction of probabilistic models on graphs, an important concept 
is that of a clique. A clique C is a maximal subset of nodes which are all joined 
by an edge (in the sense that there is no subset containing C and satisfying this 
condition). An ordering of the cliques of an undirected graph (Ci, . . . , Cn) is perfect 
if the nodes of each clique Ci contained in a previous clique are all memb^jrs of one 
previous clique (these nodes are called the separators, a G Si). In this case, the joint 
distribution of the random variable V taking values in V is 

p(v ) = n ’ 

vev 

where V{v) denotes the parents of v. This can also be written as 

(10.19) p{V) = , 

np(5i) 

i=l 

and the model is then called decomposable] see Spiegelhalter and Lauritzen (1990), 
Dawid and Lauritzen (1993) or Lauritzen (1996). As stressed by Spiegelhalter et al. 
(1993), the representation (10.19) leads to a principle of local computation, which 
enables the building of a prior distribution, or the simulation from a conditional 
distribution on a single clique. (In other words, the distribution is Markov with 
respect to the undirected graph, as shown by Dawid and Lauritzen 1993.) The appeal 
of this property for a Gibbs implementation is then obvious. 

When the densities or probabilities are parameterized, the parameters are de- 
noted by 9a for the marginal distribution oiV ^ A, A (ZV. (In the case of discrete 
models, 9 = 9v rnay coincide with p itself; see Example 10.33.) The prior distribu- 
tion 7t{9) must then be compatible with the graph structure: Dawid and Lauritzen 
(1993) show that a solution is of the form 

n 

(10.20) 7t(0) = , 

ljs-i(6>sj 

i=l 

thus reproducing the clique decomposition (10.19). 

Example 10.33. Discrete event graph. Consider a decomposable graph such 
that the random variables corresponding to all the nodes of V are discrete. Let 
It; G VF be a possible value for the vector of these random variables and 9(w) be 
the associated probability. For the perfect clique decomposition (Ci, . . . , Cn), 9{wi) 
denotes the marginal probability that the subvector {v,v G Ci) takes the value Wi 
(G Wi) and, similarly, 9{wf) is the probability that the subvector {v,v G Si) takes 
the value wf when {Si, . . . , Sn) is the associated sequence of separators. In this case. 
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n 

= ¥ — • 

i=l 

As illustrated by Madigan and York (1995), a Dirichlet prior can be constructed 
on 6w = (0{w),w G W), which leads to genuine Dirichlet priors on the Ow^ = 
(0{wi),Wi G Wi), under the constraint that the Dirichlet weights are identical over 
the intersection of two cliques. Dawid and Lauritzen (1993) demonstrate that this 
prior is unique, given the marginal priors on the cliques. || 



Example 10.34. Graphical Gaussian model. Giudici and Green (1999) provide 
another illustration of prior specification in the case of a graphical Gaussian models 
X ~ A^p(0, i7), where the precision matrix K = {kij} = must comply with the 
conditional independence relations on the graph. For instance, if and X^, are 
independent given the rest of the graph, then kvw = 0. The likelihood can then be 
factored as 

n/(xc,|r^^) 

/(x|r) = ^ , 

i=l 

with the same clique and separator notations as above, where f{x.c\E^) is the 
normal (0, i7^) density, following the decomposition (10.19). The prior on E 
can be chosen as the conjugate inverse Wishart priors on the i7^"’s, under some 
compatibility conditions. || 



Madigan and York (1995) discuss an MCMC approach to model choice and 
model averaging in this setup, whereas Dellaportas and Forster (1996) and Giudici 
and Green (1999) implement reversible jump algorithms for determining the prob- 
able graph structures associated with a given dataset, the latter under a Gaussian 
assumption. 




11 



Variable Dimension Models and Reversible 
Jump Algorithms 



“We’re wasting our time,” he said. “The one thing we need is the one thing 
we’ll never get.” 

— Ian Rankin, Resurrection Men 



While the previous chapters have presented a general class of MCMC algo- 
rithms, there exist settings where they are not general enough. A pEirticular 
case of such settings is that of variable dimension models. There, the param- 
eter (and simulation) space is not well defined, being a finite or denumer- 
able collection of unrelated subspaces. To have an MCMC algorithm moving 
within this collection of spaces requires more advanced tools, if only because 
of the associated measure theoretic subtleties. Section 11.1 motivates the use 
of variable dimension models in the setup of Bayesian model choice and model 
comparison, while Section 11.2 presents the general theory of reversible jump 
algorithms, which were tailored for these models. Section 11.3 examines fur- 
ther algorithms and methods related to this issue. 



11.1 Variable Dimension Models 

In general, a variable dimension model is, to quote Peter Green, a “mode/ 
where one of the things you do not know is the number of things you do 
not know”. This means that the statistical model under consideration is not 
defined precisely enough for the dimension of the parameter space to be fixed. 
As detailed below in Section 11.1.1, this setting is closely associated with model 
selection^ a collection of statistical procedures that are used at the early state 
of a statistical analysis, namely, when the model to be used is not yet fully 
determined. 

In addition to model construction, there are other situations where sev- 
eral models are simultaneously considered. For instance, this occurs in model 
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Fig. 11.1. Velocity (km/second) of galaxies in the Corona Borealis Region. 



checking, model comparison, model improvement, and model pruning, with 
many areas of application: variable selection in generalized linear models, 
change point determination in signal processing, object recognition in image 
analysis, coding identification in DNA analysis, dependence reconstruction in 
expert systems, and so on. 

Example 11.1. Mixture modeling. Consider the dataset represented by 
the histogram of Figure 11.1, and also provided in Table 11.1 (see Problem 
11.7). It consists of the velocities of 82 galaxies previously analyzed by Roeder 
(1992) and is often used as a benchmark example in mixture analysis (see, 
e.g.. Chib 1995, Phillips and Smith 1996, Raftery 1996, Richardson and Green 
1997, or Robert and Mengersen 1999). 

A probabilistic model considered for the representation of this dataset is 
a Gaussian mixture model, 

k 

(11.1) Mk ■■ Xi {i = l,...,82), 

but the index /c, that is, the number of components in the mixture (or of 
clusters of galaxies in the sample), is under debate. It cannot therefore be 
arbitrarily fixed to, say. A: = 7 for the statistical analysis of this dataset. || 



11.1.1 Bayesian Model Choice 

While the concept of a variable dimension model is loosely defined, we can 
give a more formal definition which mostly pertains to the important case of 
model selection. 

Definition 11.2. A Bayesian variable dimension model is defined as a collec- 
tion of models {k = 1 , AT), 



= {/(-|^/c); ^ Ok} , 
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associated with a collection of priors on the parameters of these models, 

T^k{0k) , 

and a prior distribution on the indices of these models, 

{g{k),k = l,...,K} . 

Note that we will also use the more concise notation 

( 11 . 2 ) n{k,ek) = 0{k)Trk{ek). 

This is a density, in the sense that g{k) is a density with respect to the counting 
measure on N, while 7Tk{0k) is typically a density with respect to Lebesgue 
measure on Ok - The function (11.2) is then a density with respect to Lebesgue 
measure on the union of spaces, 

e = [j{k} X Ok . 

k 

Prom a Bayesian perspective, this representation of the problem implies 
that inference is formally complete! Indeed, once the prior and the model 
are defined, the model selected from the dataset x is determined fi'om the 
posterior probabilities 



p(A4i|x) = 

3 

by either taking the model with largest p{Mi\x) or using model averaging 
through 

[ fj{x\ej)TTj{6j\x)dej , 

where x denotes the observed dataset, as a predictive distribution, even 
though more sophisticated decision theoretic perspectives could be aidopted 
(see Robert, 2001, Chapter 7). 

11.1.2 Difficulties in Model Choice 

There are several kinds of difficulties with this formal resolution of the model 
choice issue. While the definition of a prior distribution on the parameter 
space 

9 = |J{fc} X Ok 

k 



[ fi{x\0i)TTi{0i)ddi 
J0i 
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does not create any problem, thanks to the decomposition of tt{0) into 
g{k)7Tk{0k), a first difficulty is that at the inferential level model choice is 
a complex notion. The setting does not clearly belong to either the estimation 
or the testing domain. 

As an example, when considering model choice as an estimation problem, 
there is a tendency to overfit the model by selecting a model Aii with a 
large number of parameters, and overfitting can only be countered with priors 
distributions g{j) that depend on the sample size (see Robert, 2001, Chapter 
7). Similarly, when adopting the perspective that model choice is a special case 
of testing between models, the subsequent inference on the most “likely” model 
fails to account for the selection process and its inherent error. This vagueness, 
central to the model choice formulation, will translate into a requirement for 
a many faceted prior distribution. 

We must stress here that our understanding of the Bayesian model choice 
issue is that we must chose a completely new set of parameters for each model 
Ok and set the parameter space as the union of the model parameter spaces 
Ok , even though some parameters may have a similar meaning in two different 
models. 

Example 11.3. Order of an AR(p) model. Recall that an AR(p) model 
(Example 6.6) is given as the autoregressive representation of a time series, 

p 

-^p • ^ ^ — i “t“ (^p^t • 

2=1 

When comparing an AR(p) and an AR(p+ 1) model, it could be assumed that 
the first p autoregressive coefficients of ^ 2 (^+ 1 ) would be the same as the 9ipS, 
that is, that an AR(p) model is simply an AR(p-h 1) model with an extra zero 
coefficient. 

We note that it is important to consider the coefficients for each of the 
models as an entire set of coefficients, and not individually. This is not only be- 
cause the models are different but, more importantly, because the best fitting 
AR(p + 1) model is not necessarily a modification of the best fitting AR(p) 
model (obtained by adding an extra term, that is, a non-zero 
Moreover, from a Bayesian point of view, the parameters 6ipS are not in- 
dependent a posteriori. Similarly, even though the variance has the same 
formal meaning for all values of p, we insist on using a different variance 
parameter for each value of p, hence the notation || 

However, many statisticians prefer to use some parameters that are com- 
mon to all models, in order to reduce model and computational complexity 
(Problems 4.1 and 4.2), and also to enforce Occam’s parsimony requirement 
(see Note 11.5.1).^ As we will see below, the reversible jump technique of Sec- 

^ At another level, this alternative often allows for a resolution of some testing 
difficulties associated with improper priors; see Berger and Pericchi (1998) and 
Robert (2001, Chapter 6) for details. 
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tion 11.2 is based upon this assumption of partially exchangeable pairameters 
between models, since it uses proposal distributions that modify only a part 
of the parameter vector to move between models. (The centering technique of 
Brooks et al. 2003b relies on the same assumption.) 

More central to the theme of this book, there also are computational dif- 
ficulties related to variable dimension models. For instance, the number of 
models in competition may well be infinite. Even when the number of models 
is finite, there is additional complexity in representing, or simulating from, the 
posterior distribution (11.2) in that a sampler must move both within and be- 
tween models 0k . While the former (move) pertains to previous developments 
in Chapters 7-10, the latter (move) requires a deeper measure-theoretic basis 
to ensure the overall validity of the correct MCMC moves, that is, to preserve 
7t{9\x) as the stationary distribution of the simulated Markov chain on 0. 

Lastly, we mention that there is an enormous, and growing, literature on 
the topic of model selection, from both a frequentist and a Bayesian point of 
view. Starting with the seminal paper of Akaike (1974), which defined one of 
the first model selection criterion (now known as AIC), a major focus of that 
methodology is to examine the properties of the model selection procedure, 
and to insure, in some sense, that the correct model will be selected if the 
number of observations is infinite. (This is known as consistency in model 
selection.) However, many of our concerns are different from those in the 
model selection literature, and we will not go in details of the many model 
selection criteria that are available. A good introduction to the many facets 
of model selection is the collection edited by Lahiri (2001). 



11.2 Reversible Jump Algorithms 

There have been several earlier approaches in the literature to deal with vari- 
able dimension models using, for instance, birth-and-death processes (Ripley 
1977, Geyer and M0ller 1994) or pseudo-priors (Carlin and Chib 1995, see 
Problem 11.9), but the general formalization of this problem has been pre- 
sented by Green (1995). Note that, at this stage, regular Gibbs sampling is 
impossible when considering distributions of the form (11.2): if one conditions 
on /c, then 6k ^ 0k^ and if one conditions on 6k ^ then k cannot move. Therefore 
a standard Gibbs sampler cannot provide moves between models 0k without 
further modification of the setting. 



11.2.1 Green’s Algorithm 

If we let X = (/c,0fc), the solution proposed by Green (1995) is based on a 
reversible transition kernel A, that is, a kernel satisfying 
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for all A^B C 0 and for some invariant density n (see Section 6.5.3). To see 
more clearly^ how this condition can be satisfied and how a proper reversible 
kernel K can be constructed, we decompose K according to the model in 
which it proposes a move: for the model Mm, Qm denotes a transition measure 
on Mm, and pm is the corresponding probability of accepting this move (or 
jump). The decomposition of the kernel is thus 



m Jb 



pm{x, y')qm{x, dy') + lo{x)1b{x), 



where 

^(^) f ^ ^^ {PmQm)i^, 0m) 

m 

represents the probability of no move. 

Typically, and mostly for practicality’s sake, the jumps are limited to 
moves from Mm models with dimensions close to the dimension of 0^, 
possibly including Mm'^ constructing a sensible proposal Qm for a move from 
X = {m,0m) to y = {m' ,6m') is generally too difficult when Mm and Mm' 
differ by many dimensions. The definition of pm (and the verification of the 
reversibility assumption) relies on the following assumption: The joint mea- 
sure 7r{dx)qm{x, dy) must be absolutely continuous with respect to a symmetric 
measure ^m{dx, dy) on 0x0. ltgm{x,y) denotes the density of d^)Tr(dx) 
against this dominating measure ^m{dx,dy) and if pm is written in the usual 
Metropolis-Hastings form 

/X . 9m{y,x)\ 

Pm[x, y) = mm < 1, — \ , 

I 9m{x,y) } 

then reversibility is ensured by the symmetry of the measure ^m‘ 



/ / pm{x,y)qm{x,dy)n{dx) = / 

JaJb JaJe 

-j I 

-II 



Pm{x,y)g m (x,y)^ m {dx, dy) 
Pm{y, x)gm{y, x)^m{dy, dx) 
Pm{y,x)qm{y,dx)7r{dy) , 



as Pm{x,y)gm{x,y) = pm{y,x)gm{y,x) by construction. 

The main difficulty of this approach lies in the determination of the mea- 
sure ^m, given the symmetry constraint. If the jumps are decomposed into 
moves between pairs of models, Mk^ and Mk 2 , the (clever!) idea of Green 

^ The construction of the reversible jump kernel is slightly involved. For an alternate 
description of this technique, which might be easier to understand, we refer the 
reader to Section 11.2.2. There the reversible jump algorithm is justified using 
ordinary Metropolis-Hastings arguments. 
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(1995) is to supplement each of the spaces &ki and Ok2 with adequate artifi- 
cial spaces in order to create a bijection between them. 

For instance, if dim(0fcj > dim(0A:2) and if the move from Ok^ to 0k2 
can be represented by a deterministic transformation of 

0{k2) ^ 



Green (1995) imposes a dimension matching condition which is that the op- 
posite move from 0k2 to 0^1 is concentrated on the curve 

^0{ki) . Q(k2) _ , 

In the general case, if 9^^^^ is completed by a simulation ui 9i{ui) 
into {6^^^\ui) and ~ ^ 2 (^ 2 ) into {9^^^\u2) so that the mapping 

between and {6^^^\u2) is a bijection, 

(11.3) 



the probability of acceptance for the move from Mki to Mk2 is then 



(11.4) 



/ 7r(fc2,6><^^>) 7T2lff2(M2) 
V7r(fci,6l(''i)) 7 Ti23i(wi) 



dT{e^'^^\ui) \ 



involving the Jacobian of the transform (11.3), the probability TTij of choosing 
a jump to Adkj while in A4fc^, and the density of Ui. This proposal satisfies 
the detailed balance condition and the symmetry assumption of Green (1995) 
if the move from Mk2 to Mki also satisfies (11.3) with U2 ~ 92{u2)- 

The pseudo-code representation of Green’s (1995) algorithm is thus as 
follows. 



Algorithm A*48 — Green’s Algorithm — 
At iteration t, if = {m,9m), 

1 Select model Mn with probability Tr^n 

2 Generate 

3 Set Wmn) 

4 Take 9n^ — 0^ with probability 



mm 



7r(n,0n) (^nm) 

mn) 



, t£rrtn) 

d{$^^\u,nn) 



[A.48] 



and take otherwise. 



As pointed out by Green (1995), the density tt does not need to be nor- 
malized, but the different component densities nk{0k) must be known up to 
the same constant. 
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Example 11.4. A linear Jacobian. To illustrate the procedure, Green 
(1995) considers the toy example of switching between the parameters (1,^) 
and (2, ^ 2 ) using the following moves: 

(i) To go from (2,l9i,^2) to (1,^), set 9 = {61 +(92)/2. 

(ii) To go from (1,^) to (2,^i,^2)7 generate a random variable u ~ g{u) and 
set 9i = 0 — u and 62 = 6 + u. 

These moves represent one-to-one transformations of variables in that is, 



9i 02 92 — 9i 



— ^21(^15^2)5 {9 — U . 9 u ) — Ti 2 { 9 ^ u ), 



2 ’ 2 
with corresponding Jacobians 

dT2i{9,,92) 1 dTi2{9,u) 



= 2 . 



d{9u92) 2’ d{9,u) 

The acceptance probability for a move from (1,^) to (2,^i,^2) is thus 

. 7t(2,6>i,6>2) 7T21 ^ 

^ 

7r(l,6>) 7ri2g[u) 



where = (^2 ~ ^i)/2. 



11.2.2 A Fixed Dimension Reassessment 

While the above development is completely valid from a mathematical point of 
view, we now redefine Green’s algorithm via a saturation scheme that provides 
better intuition for the determination of the acceptance probability. When 
considering a specific move from Aim to Ain^ that is, from 9m ^ to 
9 n ^ 6^715 where dm = dim 0 m < dimO^ = we can indeed describe an 
equivalent move in a fixed dimension setting. 

As described above, the central feature of Green’s algorithm is to add 
an auxiliary variable Umn ^ ^mn to 9m so that 0m x ^mn and 0n are in 
bijection (one-to-one) relation. (We consider the special case where 9n needs 
not be completed.) Using a regular Metropolis-Hastings scheme, to propose 
a move from the pair {9m^Umn) to 9n is like proposing to do so when the 
corresponding stationary distributions are ir{m^9m)^mn{umn) and 7r(n, 0n)5 
respectively, and when the proposal distribution is deterministic, since 

9n — T'mn{9m-) '^mn) • 

This is an unusual setting for Metropolis-Hastings moves, because of its deter- 
ministic feature, but it can solved by the following approximation: Consider 
that the move from {9m,Umn) to 9n proceeds by generating 
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On ^ •^dn i^mniOrri') ■> £ > 0 , 



and that the reciprocal proposal is to take {9m ^ Umn) as the T^n-inverse trans- 
form of a normal J\fd^{9n,sl)- (This is feasible since Tmn is a bijection.) This 
reciprocal proposal then has the density 

exp [On - Tmn{0m,Umn)] /2g| \ dTmn{Om,Umn) 

(27T£)‘^"/2 I d{6m,Umn)) 

by the Jacobian rule. Therefore, the Metropolis-Hastings acceptance ratio for 
this regular move is 



lA 



T(n,0n) 



n{m,6m)<p mn ('^mn) 

exp {-[6n- Tmn 



^Tmn{0m') '^mn) 



^{9m-) '^mn)) \ 

{Om, Umn)f /2£| ' 



exp |- [dn - Tmn{^m-,Umn)f /2£| /(27T£)'*’»/2 ^ 



and the normal densities cancel as in a regular random walk proposal. If we 
take into account the probabilities of the moves between Mm and Mn, we 
thus end up with 



1 A 



7r(n, 6n)T^n 



7 t{tTI^ Om^^rnn{^rnn)'^n 



dTmn {9m , '^rt 



^{9m, '^mn)) 



Since this probability does not depend on we can let e go to zero and 
obtain the equivalent of the ratio (11.4). The reversible jump algorithm can 
thus be reinterpreted as a sequence of local fixed dimensional moves between 
the models Mk (Problem 11.5). 



11.2.3 The Practice of Reversible Jump MCMC 

The dimension matching transform, Tmn, while incredibly fiexible, can be 
quite difficult to create and much more difficult to optimize; one could almost 
say this universality in the choice of Tmn is a drawback with the method. In 
fact, the total freedom left by the reversible jump principle about the choice of 
the jumps, which are often referred to as split and merge moves in embedded 
models, creates a potential opening for inefficiency and requires tuning steps 
which may be quite demanding. As also mentioned in a later chapter, this is 
a setting where wealth is a mixed blessing^ if only because the total lack of 
direction in the choice of the jumps may result in a lengthy or even impossible 
calibration of the algorithm. 

Example 11.5. Linear versus quadratic regression. Instead of choosing 
a particular regression model, we can use the reversible jump algorithm to do 
model averaging. 
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Suppose that the two candidate models are the linear and quadratic re- 
gression models; that is, 



Vi = Po + PiXi + Si and = Pq + p^Xi + P2XI + Si . 

If we represent either regression by y = X/3 + e, where s ~ M{0,a^I), the 
least squares estimate P = (X'X)“^X'y has distribution 

V(/3,ct2(X'X)-1) . 

Using normal prior distributions will result in normal posterior distributions, 
and the reversible jump algorithm will then be jumping between a two- 
dimensional and three-dimensional normal distribution. 

To jump between these models, it seems sensible to first transform to 
orthogonal coordinates, as a jump that is made by simply adding or deleting 
a coefficient will not affect the fit of the other coefficients. We thus find an 
orthogonal matrix P and diagonal matrix D\ satisfying 

P' (X'X) P = Dx. 

The elements of D\ , Xi , are the eigenvalues of X'X and the columns of P are 
its eigenvectors. We then write X* = XP and a = P'P, and we work with 
the model y = X*a -f s. 

If each ai has a normal prior distribution, ai ^ its posterior 

density, denoted by fi, is M {hi, bier‘s), where hi = • The possible moves 

are as follows: 

(i) linear ^ linear: (ao,ai) — ^ (aQ,a'i), where {aQ,a'i) ~ /o/i , 

(ii) linear — > quadratic: (ao?cei) <^ 2)5 where a '2 ^ f 2 ^ 

(iii) quadratic — > quadratic: (ao,tri,a 2 ) ^ (<^ 05 ^ 15 ^ 2)5 where (o;Q,ai,a 2 ) ^ 
/0/1/2 , 

(iv) quadratic — > linear: {ao,ai,a 2 ) (aQ,a'i), where = (o;o,Q;i). 

The algorithm was implemented on simulated data with move probabilities 
TTij all taken to be 1/4 and a prior probability of 1/2 on each regression model. 
The resulting fits are given in Figure 11.2. It is interesting to note that when 
the model is quadratic, the reversible jump fit is close to that of quadratic 
least squares, but it deviates from quadratic least squares when the underlying 
model is linear. (See Problem 11.3 for more details.) || 



Example 11.6. Piecewise constant densities. Consider a density / on 
[0, 1] of the form 

k 

fix) = , 

i=l 
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X X 

Fig. 11.2. Comparison of quadratic least squares (dashed line) and reversible jump 
(solid line) fits. In the left panel the underlying model is quadratic, and in the right 
panel the underlying model is linear. 



ai = 0, a/e+i = 1, and '^uji{ai^i - ai) = 1. 

i=l 

This model corresponds, for instance, a special kind of nonparametric density 
estimation, where the estimator is a step function. 

Assuming all parameters unknown (including /c), define pi = cui{ai.^i — a^), 
so these are probabilities. Let the prior distribution be 






X^e ^ r(fe/2)pi 
k\ r(i/2)^ 



{k - l)!Ia2<...<a;,, 



where p^^^ = (pi, • . . and = (u 2 , . . . , a/c), which implies a Poisson 
distribution on /c, P(A), a uniform distribution on {a 2 , . . . , a/c}, and a Dirichlet 
distribution T)/c(l/2, . . . , 1/2) on the weights pi of the components l'([ai,ai+i] 
of /. (Note that the density integrates to 1 over 0.) 

For a sample xi, . . . , the posterior distribution is 



7r(/c,p^^\a^^^|xi, . . . ,Xn) oc 



rjk/2) ni-l/2 nfc-1/2 
k r{l/2)^^l • * 



where nj is the number of observations between aj and aj^i. We can, for 
instance, restrict the moves to jumps only between neighboring models; that 






436 11 Variable Dimension Models and Reversible Jump Algorithms 



is, models with one more or one less component in the partition of [0, 1]. We 
then represent the jump from model Ok to model Ok-i as a random choice of 
i < k — 1 ), and the aggregation of the ith and (i + l)st components as 



(k-i) _ (k) 



— do 



(k-i) _ (k) (k-i) _ (k) 



= a 






(k-i) _ (k) 



and 



(k-i) (k) 

Pi =Pi\ 



^P 



(fc-i) 






(fe-i) 



(k) 

■Pk ■ 



For reasons of symmetry, the opposite (upward) jump implies choosing a 
component i at random and breaking it by the procedure 



1. Generate u\^U 2 

2. Take + (1 - 

3. Take pf'* = U 2 pf~^\ and = (1 - U 2 )pf~^\ 



~ W[o,i); 
) .W 



(fe-i) , (fe) (fc-i) 

and 



The other quantities remain identical up to a possible index shift. The weight 
corresponding to the jump from model Ok to model Ok+i is then 



min 1 1 , 



7r{k + 

7r(fc,p(^), a(^)) 






d{p^^\a^^\ui,U2) 



} 



As an illustration, take /c = 3 and consider the jump from model @3 to 
model O 4 . The transformation is given by 



fpf\\ 

P3^ 



Ui 

\U 2 ) 



f Pi \ 

U 2 P 2 

(1 - U 2 )P 2 
P3 
Ci 2 

Uia2 + (1 - ui)as 

\ «3 J 



( pf^ \ 

p^^ 

Pz^ 

pf'' 

4^) 



with Jacobian 






d{p^^'>,a(^'>,Ui,U2) 



0 

U 2 

0 

0 

0 

0 



0 p^ 2 ^ 

P2\<^2^ - 4'"0 



0 

I-U 2 

0 

0 

0 

0 

P 2 ^ 
(3)^ 



0 0 
0 0 



0 

0 

0 

Ui 



0 0 1 - Ml 1 

0 0 4^^ - 0 
0 0 0 0 



(See Problem 11.6 for extensions.) 
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These two examples are fairly straightforward, and there are much more 
realistic illustrations in the literature. We will see two of these below for mix- 
tures of distributions and AR(p) models, respectively. See Green (1995), Deni- 
son et al. (1998), and Holmes et al. (2002) for other illustrations. In particular, 
Cappe et al. (2003, 2004), provide in great detail the implementation of the 
reversible jump algorithm for a hidden Markov model (see Section 14.3.2). 

Example 11.7 is quite popular in the reversible jump literature, in that 
Richardson and Green (1997) dealt with a realistic mixture model, providing 
sensible answers with seemingly straightforward proposals. 



Example 11.7. (Continuation of Example 11.1) If we consider a model 
A4k to be the /c-component normal mixture distribution Aik in (ll-I)^ moves 
between models involve changing the number of components in the mixture 
and thus adding new components or removing older components. 

A first possibility considered in Richardson and Green (1997) is to restrict 
the moves to models with one more or one less component, that is, from A4k 
to Aik-^i or Aik-i, and to create a reversible birth-and-death process using 
the prior on the components (assumed to be independent) as a proposal for 
the birth step. 

The birth step associated with the move from Mk to Mk-\-i consists in 
adding a new normal component in the mixture. The new {k -h l)-component 
normal mixture distribution is then made of the previous fc-component nor- 
mal mixture distribution and a new component, with weight p(/c+i)(^c+i)? and 
mean and variance, /a(^k-{-i){k-\-i) and These parameters can be gen- 

erated in many ways, but a natural solution is to simulate p(^k-\-i){k-\-i) the 
marginal distribution of (pi(/c+i )7 • • • ^P{k+i){k-\-i)) (which typically is a Dirich- 
let V{a \, . . . ,o;/e+i)) and then simulate /a(^k-\-i){k-\-i) and crfk-\-i)(k-\-i) the 

corresponding prior distributions. Obviously, constraining the weignts to sum 
to one implies that the weights of the k component mixture, Pi(fc), • • • ^Pk(k) 
have to be multiplied by (1 — p(^k-i-i){k-\-i)) to obtain the weights of the (/c-h 1) 
mixture, Pi(fc+i), • • • ,Pk{k-^i) • The parameters of the additional component are 
denoted Uk{k+i) with corresponding proposal distribution Pk{k-^i){'^k{k-i-i))- 

It follows from the reversibility constraint that the death step is neces- 
sarily the (deterministic) opposite. We remove one of the k components and 
renormalize the other weights. In the case of the Dirichlet distribution, the 
corresponding acceptance probability for a birth step is then, following (11.4), 



I '^{k+i)k {k-hl)l 
k\ 



min I — ^ — 

\'^k{k-^l) '^ki^k) {k l)(pk{k-\-l)i'^k{k-{-l)) J 



, i 



(11.5) 



. ^ '^{k-{-l)k p{k 1) £k-\-l{0k-]-l) {t Pk-^l)^ ^ \ 



where £k denotes the likelihood of the k component mixture model Aik^ if 
we use the exact prior distributions (Problem 11.11). Note that the factorials 
k\ and (/c + 1)! in the above probability appear as the numbers of ways of 
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ordering the k and A: + 1 components of the mixtures (Problem 11.12): the 
ratio cancels with 1 /(A: + 1 ), which is the probability of selecting a particular 
component for the death step. If 'K[k+\)k = q{^) — A^/A:!, the 

birth acceptance ratio thus simplifies into 



min 



A(l-Pfc+i)^ ^ 4+i(6^fc+i) 
k + 1 £k{0k) 



While this proposal can work well in some setting, as in Richardson and 
Green (1997) when the prior is calibrated against the data, it can also be 
inefficient, that is, leading to a high rejection rate, if the prior is vague, since 
the birth proposals are not tuned properly. A second proposal, central to 
the solution of Richardson and Green (1997), is to devise more local jumps 
between models. 

In this proposal, the upward move from A4k to Affc+i is called a split 
move. In essence, it replaces a component, say the jth, with two components 
centered at this earlier component. The split parameters are thus created 
under a moment condition 



Pjk — Pj{k-\-l) d- P(j4-i)(fc+i) , 

(11-6) PjkP>j^ — Pj{k-\-l)P'j{k+l) d" P(j-|-i)(A:+l)/^(jf+l)(/c+l) 5 

The downward move, called a merge^ is obtained directly as (11.6) by the 
reversibility constraint. 

The split move satisfying (11.6) can, for instance, be obtained by generat- 
ing the auxiliary variable Uk(k-\-i) Ui,U 2 ^ us ~ W( 0 , 1 ), and then taking 



Pj{k-{-l) '^iPjk 7 
P'jik+l) = , 



P(j+l)(fc+l) = (1 - Ui)pjk , 



MO + l)(/c+l) ~ 



Pjk-Pj(k+l)U2 
Pjk ~Pj(k + l) 
Pjk-Pj(k+l)U3 
Pjk ~Pj(k + l) 




The Jacobian of the split transform is thus 



/ Ui 1 — ui • • • 



Pjk 

0 

0 

0 

\0 



Pjk 

0 U2 

0 P'jk 

0 0 

0 0 



Pjk Pj{k-{-l)^2 
Pjk-Pj(k + 1) 
~PKfc+i) ^ 
Pjk-Pjik+l) 



0 «3 
0 



\ 



Pjk-Pj{k + \)U3 

Pjk-Pj{k + 1) 

-Pj(fc+i) 2 J 

Pjk-Pj(k + 1) j^/ 



with a block diagonal structure that does not require the upper part of the 
derivatives to be computed. The absolute value of the determinant of this 
matrix is then 
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Pjk ^ 1 P'j k I “ 



Pjk 



X a: 



jk ' 



Pjk 



P% 



Pjk Pj{k-\-l) Pjk Pj{k-\-l) (1 '^l)“ 

and the corresponding acceptance probability is 



\fJ-jk\cr]ki 



min 



^(fc+l)fc ^(^"1“ 1) TTfc+l (^fc+l (^fc+l 

7rfe(fc+i) Q{k) 7rfe(6>fe)4(6>fe) 



P% 



iMjfckjfc, 1 



This is one possible jump proposal following Richardson and Green (1997), 
and others could be implemented as well with a similar or higher efficiency. 
Note also that the implementation of Richardson and Green (1997) is slightly 
different from the one presented above in that, for them, only adjacent com- 
ponents can be merged together. (Adjacency is defined in terms of the order 
on the means of the components.) There is no theoretical reason for doing so, 
but the authors consider that merging only adjacent components may result 
in an higher efficiency (always in terms of acceptance probability): merging 
components that are far apart is less likely to be accepted, even though this 
may create a component with a larger variance that could better model the 
tails of the sample. We stress that, when this constraint is adopted, reversibil- 
ity implies that the split move must be restricted accordingly: only adjacent 
components can be created at the split stage. 

The implementation of Richardson and Green (1997) also illustrates an- 
other possibility with the reversible jump algorithm, namely, the possibility 
of including fixed dimensional moves in addition to the variable dimensional 
moves. This hybrid structure is completely valid, with a justification akin to 
the Gibbs sampler, that is, as a composition of several MCMC steps. Richard- 
son and Green (1997) use fixed dimension moves to update the hyperparam- 
eters of their model, as well as the missing variables associated with the mix- 
ture. (As shown, for instance, in Cappe et ah, 2003, this completion step is 
not necessary for the algorithm to work.) 

Figures 11.3-11.5 illustrate the implementation of this algorithui for the 
Galaxy dataset presented in Figure 11.1, also used by Richardson and Green 
(1997). In Figure 11.3, the MCMC output on the number of compou.ents k is 
represented as a histogram on A:, and the corresponding sequence of /c’s. The 
prior used on A: is a uniform distribution on {1, ... , 20}: as shown by the lower 
plot, most values of k are explored by the reversible jump algorithm, but the 
upper bound does not appear to be unduly restrictive since the ’s hardly 
ever reach this upper limit. 

Figure 11.4 illustrates one appealing feature of the MCMC experiment, 
namely that conditioning the output on the most likely value of k (the pos- 
terior mode equal to 3 here) is possible. The nine graphs in this figure show 
the joint variation of the three types of parameters, as well as the stability of 
the Markov chain over the 1,000,000 iterations; the cumulative averages are 
quite stable, almost from the start. 

The density plotted on top of the histogram in Figure 11.5 is another 
good illustration of the inferential possibilities offered by reversible jump al- 
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Fig. 11.3. Histogram and raw plot of 100,000 /c’s produced by a reversible jump 
MCMC algorithm for the Galaxy dataset of Figure 11.1. 



gorithms, as a case of model averaging: this density is obtained as the average 
over iterations t of 







which approximates the posterior expectation E[/(y|^)|x], where x denotes 
the data xi,...,X 82 * (This can also be seen as a Bayesian equivalent of a 
kernel estimator.) || 



Example 11.8. (Continuation of Example 11.3) An AR(p) model 

p 

(11.7) Xt — 'y^^OjpXt-i + (JpSt 

2=1 

is often restricted by a stationarity condition on the process (Xt) (see, e.g., 
Robert, 2001, Section 4.5). While this constraint can be expressed either on 
the coefficients 9 ip of the model or on the partial auto-correlations, a conve- 
nient (from a prior point of view) representation is to use the lag-polynomial 
associated with the model Adp, 

p 

Xt = et, et-V(0,a2), 

2=1 

and to constrain the inverse roots, to stay within the unit circle if complex 
and within [—1, 1] if real (see Barnett et al. 1996, Huerta and West 1999, and 
Robert, 2001, Section 4.5.2). A natural prior in this setting is then to use 
uniform priors for the real and complex roots A^, that is, 

^ In order to simplify notation, we have refrained from using a double index on the 
Ai’s, which should be, strictly speaking, the Aip’s. 
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Fig. 11.4. Reversible jump MCMC output on the parameters of the model M 2 , for 
the Galaxy dataset, obtained by conditioning on k = 3. The left column gives the 
histogram of the weights, means, and variances; the middle column the scatterplot of 
the pairs weights-means, means-variances, and variances-w eights; the right column 
plots the cumulated averages (over iterations) for the weights, means, and variances. 




where [p/2\ + 1 is the number of different values of Vp. 

Note that this factor, while unimportant for a fixed p setting, must be 
included within the posterior distribution when using reversible jump since it 
does not vanish in the acceptance probability of a move between models Mp 
and Mq. If it is omitted in the acceptance probability of the reversible jump 
move, this results in a modification of the prior probability of each model 
Mp, from g{p) to ^(p)([p/2j + 1). See also Vermaak et al. (2003) for a similar 
phenomenon. 
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Fig. 11.5. Fit of the dataset by the averaged density, E[f{y\6)\x] 



The most basic choice for a reversible jump algorithm in this framework is 
to use the uniform priors as proposals for a birth-and-death scheme where the 
birth moves are either from A4p to Adp+i, for the creation of a real root A^+i, 
or from Mp to Adp+ 2 , for the creation of two conjugate complex roots A^^-i 
and Ap-^i. The corresponding death moves are then completely determined: 
from Afp+i to A4p with the deletion of a real root (if any) and from A^p +2 
to Aip with the deletion of two conjugate complex roots (if any). 

As in the birth-and-death proposal for Example 11.7, the acceptance prob- 
ability simplifies quite dramatically since it is, for example, 

/'7T(p_j_i)p (rp -h 1)! [p/2j + 1 (0p_i_i) \ 

r,! L(p+1)/2J + 1 f,(0,) 

in the case of a move from Mp to Mp^\. (As for the above mixture example, 
the factorials are related to the possible choices of the created and the deleted 
roots.) 

Figure 11.6 presents some views of the corresponding reversible jump 
MCMC algorithm. Besides the ability of the algorithm to explore a range 
of values of A:, it also shows that Bayesian inference using these tools is much 
richer, since it can, for instance, condition on or average over the order A;, mix 
the parameters of different models and run various tests of these parameters. 
A last remark on this graph is that both the order and the value of the param- 
eters are well estimated, with a characteristic trimodality on the histograms 
of the ^i’s, even when conditioning on k different from 3, the value used for 
the simulation. We refer the reader to Vermaak et al. (2003) and Ehlers and 
Brooks (2003) for different analyses of this problem using either the partial 
autocorrelation representation (and the Durbin-Levinson recursive algorithm) 
or the usual AR(p) parameters without stationarity constraint. || 

For any given variable dimension model, there exist an infinity of possible 
reversible jump algorithms. Compared with the fixed dimension case of, for 
instance, Chapter 7, there is more freedom and less structure, because the 
between model moves cannot rely on an Euclidean structure common to both 
models (unless they are embedded). Brooks et al. (2003b) tried to provide 
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Fig. 11.6. Output of a reversible jump algorithm based on an AR(3) simulated 
dataset of 530 points (upper left) with true parameters 6i (—0.1, 0.3, —0.4) find a — 1. 
The first histogram is associated with /c, the following histograms are associated with 
the ^i’s, for different values of /c, and of . The final graph is a scatterplot of the 
complex roots (for iterations where there were complex root(s)). The penultimate 
graph plots the evolution over the iterations of ^ 1 ,^ 2 , ^ 3 . {Source: Robert 2004.) 



general strategies for the construction of efficient reversible jump algorithms 
by setting up rules to calibrate these jumps. While their paper opens too 
many new perspectives to be entirely discussed here, let us mention a scaling 
proposal that relates to the calibrating issue of Section 7.6. 

Brooks et al. (2003b) assume that a transform Tmn that moves the com- 
pletion of A4rn into the completion of A4n has been chosen, as well as a cen- 
tering function^ Cmn{^m)^ which is a special value of On corresponding to Om- 
For instance, in the setup of Example 11.1, if = (^ 1 , . . . , Op, a), a possible 
centering function is 

^p(p+i) (^p) (^1 7 • • • 5 ^p? O5 ^) • 

Once both Tmn and Cmn have been chosen, the calibration of the remaining 
parameters of the proposal transformation, that is, of the parameters of the 
distribution (fmn of Umn such that (^m,^nm) = Tmn{0m,Umn), is based on 
the constraint that the probability (11.4) of moving from Om to Cmn{0m) is 
equal to one. Again, in the setup of Example 11.1, if Op is completed by 
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^p(p+i) = ^ the scale r of this completion can be determined 

by this constraint which, in Brooks et al.’s (2003b) case becomes 

T^=al , 

V^(p+i)p/ 

where is the variance of the prior distribution on the 6iS. 

This calibration scheme can be generalized to other settings by impos- 
ing further conditions on the acceptance probability for moving from 9m to 
Cmn(^m), ^s detailed in Brooks et al. (2003b). Obviously, it cannot work in 
all settings, because this acceptance probability often depends on the specific 
value of 6m^ and it does not necessarily provide sensible results (as shown 
for instance in Robert 2004). The second difficulty with this approach is that 
it is completely dependent on the choice of the transform and the centering 
function. Brooks et al. (2003b) suggest choosing Cmji so that 



-^n(Cmn(^m)) — ^m{9m) ? 
but this is not always possible nor solvable. 



11.3 Alternatives to Reversible Jump MCMC 

Since it first appeared, reversible jump MCMC has had a vast impact on vari- 
able dimension Bayesian inference, especially model choice, but, as mentioned 
earlier, there also exist other approaches to the problem of variable dimension 
models. 

11.3.1 Saturation 

First, Brooks et al. (2003b) reassess the reversible jump methodology through 
a global saturation scheme already used in Section 11.2.2. More precisely, 
they consider a series of models M k with corresponding parameter spaces Ok 
{k = 1, . . . , i?) such that max^ dim{Ok) = Umax < oo. For each model Ad/c, 
the parameter 9k is then completed with an auxiliary variable Uk ~ ^k{uk) 
such that all {9k,Ukys are one-to-one transforms of one another. The authors 
define, in addition, rv’s uJk independently distributed from that are used 
to move between models. For instance, in the setting of Example 11.8, the 
saturated proposal on M-k may be 

(11.8) {ak+LOo,Oi+ui,...,Ok+ujk,Uk) , Ufc G , 

if there is no stationarity constraint imposed on the model. 

Brooks et al. (2003b) then assign the following joint auxiliary prior distri- 
bution to a parameter in Ok, 
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^max 

■^{k,6k)qk{uk) R ■0(wi). 

i=l 

Within this augmented (or saturated) framework, there is no varying dimen- 
sion anymore since, for all models, the entire vector {0k,Uk,oj) is of fixed 
dimension. Therefore, moves between models can be defined just as freely as 
moves between points of each model. See also Besag (2000) and Godsill (2003) 
for a similar development, which differs from the full saturation scheme of 
Carlin and Chib (1995) (Problem 11.9). 

Brooks et al. (2003b) propose a three-stage MCMC update as follows. 

Algorithm A,49 —Saturation Algorithm— 

At iteration t 

a. Update the current value of the paranneter, Ok, within model Mk 

b. Update Uk and w conditional on 9k 

c. Update the model from Mk into My using the bijection 

Note that, for specific models, saturation schemes appear rather naturally. 
For instance, Creen (1995) considered a time series model with change-points, 
^ ~ p{yk — 0k), where 6t changes values according to a geometric jump scheme 
(see Problems 11.15, 11.17 and 11.18). Whatever the number of change-points 
in the series, a reparameterization of the model by the missing data, composed 
of the indicators Xt of whether or not a change occurs at time t, creates a 
fixed dimension model. (See Cappe et al., 2003, for a similar representation 
of the semi-Markov jump process introduced in Note 14.6.3.) 

Example 11.9. (Continuation of Example 11.8) Within model Mk, 
for the AR(fc) parameters ^i, . . . ,0^ and ak, the move in step a can be any 
Metropolis-Hastings step that preserves the posterior distribution (condi- 
tional on fc), like a random walk proposal on the real and complex roots of the 
lag-polynomial. The remaining Uk can then be updated via an AR(1) move 

= Xuk -f \/l — e if the proper stationary distribution qk{uk) is a A7(0, 1) 
distribution. As noted in Brooks et al. (2003b), the cj^’s are not really neces- 
sary when the UkS are independent. In the move from Mk to Mk-\-i, Brooks 
et al. (2003b) suggest using (11.8), or a combination of the non-saturated 
approach with an auxiliary variable where 6k-\-i = (TUk,k-\-i‘ If is updated 
symmetrically, that is, if u/e,fc+i = Ok^ijo when moving from Mk-\-i to Mk, 
the algorithm provides the chain with some memory of previous values within 
each model and should thus facilitate moves between models. || 

The motivation of the saturation scheme by Brooks et al. (2003b) is not 
a simplification of the reversible jump algorithm, since the acceptance prob- 
abilities remain fundamentally the same in step c of Algorithm [A. 49], but 
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rather a “memory” over the moves that improve between-model moves. (For 
instance, the components of the u^s may be correlated, as shown in Brooks 
et al. 2003b.) The possibilities open by this range of possible extensions is 
quite exciting but wealth is a mixed blessing in that the choice of such moves 
requires a level of expertise that escapes the layman. 



11.3.2 Continuous-Time Jump Processes 

Reversible jump algorithms operate in discrete time, but similar algorithms 
may be formulated in continuous time. This has a long history (for MCMC), 
dating back to Preston (1976), Ripley (1987), Geyer and Mqller (1994). Here 
we focus on Stephens’s (2000) methodology, called birth- and- death MCMC 
which he developed for mixture models. 

The idea at the core of this continuous time approach is to build a Markov 
jump process^ to move between models M-k, using births (to increase the 
dimension) and deaths (to decrease the dimension) , as well as other dimension- 
changing moves like the split /combine moves used in Example 11.7 (see Cappe 
et al. 2003). The Markov jump process is such that whenever it reaches some 
state 0, it stays there for a exponential Sxp{X{6)) time with intensity X{6) 
depending on 6, and, after expiration of this holding time, jumps to a new 
state according to a transition kernel. More precisely, the various moves to 
other models are all associated with exponential waiting times depending on 
the current state of the Markov chain and the actual move corresponds to the 
smallest waiting time (which is still exponentially distributed; see Problem 
11 . 20 ). 

An important feature that distinguishes this algorithm from the previous 
MCMC algorithms in discrete time is the jump structure: whenever a jump 
occurs, the corresponding move is always accepted. Therefore, what replaces 
the acceptance probability of reversible jump methods are the diflFerential 
holding times in each state. In particular, implausible configurations, that is, 
configurations with small hk{0k)T^{k,6k)^ die quickly. 

To ensure that a Markov jump process has an invariant density propor- 
tional to \jk{0k)r:{k,9k) on Mk^ that is, the posterior distribution based on 
the prior 7T{k,0k) and the likelihood function \jk{0k), it is sufficient (although 
not necessary) that the local detailed balance equations 

(11.9) hk{0k)T^{k, 0k)Xk^{9k, Oi) = Li{6i)7T{i, 0i)Xik{0i, Ok) 

hold. Here Xki{0k^0() denotes the intensity of moving from state Ok G Ok to 
Oe E 0i. (The proof is beyond the scope of this book, but Cappe et al. 2003 
provide an intuitive justification based on the approximation of the continuous 
time process by a sequence of reversible jump algorithms.) 
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Algorithm A. 50 — Continuous-Time MCMC — 

At time t, for = (kydic) 

1. For ^ € {Ai, . . . , ftm}. generate u^e ^ ^pkd'i^ki) so that 

= Tke{0k^ ^iki) 

and compute the intensity 



(ILIO) Ak = 



U{&e)‘!^{f,(fe) fftkiuek) 



9Tw(0fc,Mfcf) 



9{0k,Ukt) 



2. Compute the full intensity 

A = Afcfci + ^ ■ ■ + 



3. Generate the jumping time as t with V ^ £xp{X)^ 

4. Select the Jump move as ki with probability Am^/A. 

5. Set time to t i + V and take = (kiySf^.). 



[A.50] , 



Moves that update model parameters (or hyperparameters) without chang- 
ing its dimension may also be incorporated as additional jump processes, that 
is, by introducing corresponding (fixed) intensities for these moves in the sum 
of Step 2. above. Note the strong similarity of (11.10) with the reversible 
jump acceptance probability (11.4). Recall that the intensity of an exponen- 
tial distribution is the inverse expectation: this means that (11.10) is inversely 
proportional to the waiting time until the jump to (£, 6i) and that this waiting 
time will be small if either L^(^^) is large or if Lk{0k) is small. The latter case 
corresponds to a previous move to an unlikely value of 6. 

The continuous time structure of [A. 50] is rather artificial. Rather than 
being simulated as in Step 3., using a regular Rao-Blackwellization argument, 
it can just as well be implemented as a sequence of iterations, where each value 
(A:, 6k) is weighted by the average waiting time A“^. 

The following example illustrates this generalization of Stephens (2000) in 
the setting of mixtures. 



Example 11.10. (Continuation of Example 11.1.) For the following rep- 
resentation of the normal mixture model. 



k I k 

Yt ~ ^WiA/'(/Xi,crf) 

i=l ' i=l 

Cappe et al. (2003) have developed split- and-combine moves for continuous 
time schemes, reproducing Richardson and Green (1997) within this context. 
The split move for a given component 6i of the k component vector 6k is to 
split this component as to give rise to a new parameter vector with k 1 
components, defined as 
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(^ 1 , . . . . . . ,0k^T{9£,u^]^^i^)) , 

where T is a differentiable one-to-one mapping that produces two new com- 
ponents and rtfc(fc+i) ~ Note that the representation of the 

weights via the cj^’s is not identifiable but avoids the renormalizing step in 
the jump from Aik to Aik+i or A4k-i: the ^^’s are then a priori independent. 
We assume that the mapping T is symmetric in the sense that 



(11.11) P(T((9, u)eB' X B") = P{T{6, u) e B" x B') 



for all B',B" cRxR\. 

We denote the total splitting intensity by r]{6k)^ which is the sum over £ = 
1, . . . , A: of the intensities associated with a split of 6i in (0^, assuming 

that, in a split move, each component £ is chosen with equal probability 1/k. 
In the basic scheme, all the split intensities can be chosen as constants, that 
is, rj/k. 

Conversely, the combine intensity, that is, the intensity of the waiting time 
until the combination of a pair of components of Ok (there are k{k — l)/2 such 
pairs), can be derived from the local balance equation (11.9). For the mixture 
model and a move from Ok^i to 0^, with a combination of two components of 
6k+i into 0(, and an auxiliary variable Uk{k+i)^ the combine intensity X(k-{-i)k 
is thus given by 



Lk{9k)7r{k,6k)k\'^2<piu^k+i)) 



ar(6>^,Ufc(fc+i)) 

d{0e,Uk(k+i)) 



— Lfc+i(^/c+i)7r(A: -h 1, Ok+i) {k + 1)! X{k+i)k • 



As previously seen in Example 11.7, the factorials arise because of the ordering 
of components and rj{9)/k is the rate of splitting a particular component 
as r]{9) is the overall split rate. The factor 2 is a result of the symmetry 
assumption (11.11): in a split move, a component 9 can be split into the pair 
(^', 9") as well as into the pair {9", 9') in reverse order, but these configurations 
are equivalent. However, the two ways of getting there are typically associated 
with different values of u and possibly also with different densities the 

symmetry assumption is precisely what assures that the densities at these two 
values of u coincide and hence we may replace the sum of two densities, that 
we would otherwise be required to compute, by the factor 2. We could proceed 
without such symmetry but would then need to consider the densities of u 
when combining the pairs {9' ,9") and {9" ,9'), respectively, separately. 

Thus, the intensity corresponding to the combination of two components 



( 11 . 12 ) 



Lfc(6'fe)7r(A:,6»A:) n{9k) ^ . 

U+i{ek+i)n{k + l,^fc+i) {k + l)/c ^'=('=+1)^ 



dT{e',e") 

d{9^ u) 



Cappe et al. (2003) also compared the discrete and continuous time ap- 
proaches to the mixture model of Example 11.10 and they concluded that the 
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differences between both algorithms are very minor, with the continuous time 
approach generally requiring more computing time. || 



11.4 Problems 

11.1 In the situation of Example 11.3, we stress that the best fitting AR(p-f 1) model 
is not necessarily a modification of the best fitting AR(p) model. Illustrate this 
fact by implementing the following experiment. 

(a) Set p = 3 and parameter values 6i = z = 1, . . . , 3. 

(b) Simulate two data sets, one from an AR(3) model and one from and AR(1) 
model. 

For each data set: 

(c) Estimate the AR(3) model parameters 9i and for z = 1,...,3, using 

independent priors 6i ~ A/’(0, T^) and Qa{l, 1) 

(d) Estimate the AR(1) model parameters 9i and ai. Then for z ~ 2, ...,p, 

estimate 9i conditional on the 9i-i, 9i-2 , . • • under independent priors 9ip ^ 
A/'(0,r^), 6a (1, 1) for both simulated datasets. 

(e) Compare the results of (b) and (c). 

11.2 In the situation of Example 7.13, if a second explanatory variable, z. is poten- 
tially influential in 

E\Y\x z] = exp(g + 6x + cz) 

1 + exp(a + bx + cz) ’ 

the two competing models are 0i = {1} x and O 2 = {2} x with 9i = (a, b) 
and 92 = (a, 6, c). Assume the jump from O 2 to Oi is a (deterministic) transform 
of (1, a, 6, c) to (2, a, b). Show that the jump from 0i to O 2 in a reversible jump 
algorithm must necessarily preserve the values of a and b and thus only modify 
c. Write out the details of the reversible jump algorithm. 

11.3 For the situation of Example 11.5, show that the four possible moves are made 
with the following probabilities: 

(i) (ao, ai) (a'o, ai), with probability tth min (^1, > 

(ii) (qo, ai) ^ (ao, ai, ai), with probability 7 Ti 2 min ^1, j 

(iii) (ao,ai,a2)-^ (ai, a), ai) with probability 7721 min (^1, 

(iv) (ao,ai,a2) (ao,ai) with probability 7722 min (^1, /o(ao)A(a'i )/2\c«2) ) ' 

11.4 Similar to the situation of Example 11.5, the data in Table 7.6, given in Prob- 
lem 7.21, are a candidate for model averaging between a linear and quadratic 
regression. This “braking data” (Tukey 1977) is the distance needed to stop (y) 
when an automobile travels at a particular speed (x) and brakes. 

(a) Fit the data using a reversible jump algorithm that jumps between a linear 
and quadratic model. Use normal priors with relatively large variances. 

(b) Assess the robustness of the fit to specifications of the TTij. More precisely, 
choose values of the TTij so that the algorithm spends 25%, 50%, and 75% of 
the time in the quadratic space. How much does the overall model average 
change? 
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11.5 Consider a sequence of models Mk^ k = 1,...,AT, with corresponding param- 
eters Ok ^ Ok ‘ For a given reversible jump algorithm, the auxiliary variables 

Umn 

are associated with completion transforms Tmn such that 



{Om-jUrnn^ — Tmn(^0mUnm^ • 



(a) Show that the saturated model made of the distribution of the Ok^s com- 
pleted with all the Umn’s, 



7r(^kj0k^ ^mn (u- mn) 5 

mn 



is a fixed dimension model with the correct marginal on {k,0k)- 

(b) Show that the reversible jump move is not a Metropolis-Hastings proposal 
compatible with this joint distribution. 

(c) Explain why the Umn ’s do not need to be taken into account in moves that 
do not involve both models Mn and Mm- {Hint: Include an additional pair 
{ummUnm) and check that they vanish from the probability ratio.) 

11.6 Consider the model introduced in Example 11.6. 

(a) Show that the weight of the jump from Ok+i to Ok involves the Jacobian 



d{p^^\a^^\ui,U2) 



(b) 



Compute the Jacobian for the jump from O 4 to Os- (The determinant can 
be calculated by row-reducing the matrix to triangular form, or using the 



fact that 



A B 
C D 



= |A| X \D-CA~^B\.) 



11.7 For the model of Example 11.6: 

(a) Derive general expressions for the weight of the jump from Ok to Ok+i and 
from Ok+i to Ok- 

(b) Select a value of the Poisson parameter A and implement a reversible jump 
algorithm to fit a density estimate to the data of Table 11.1. {Note: Pre- 
vious investigation and subject matter considerations indicate that there 
may be between three and seven modes. Use this information to choose an 
appropriate value of A.) 

(c) Investigate the robustness of your density estimate to variations in A. 

11.8 (Chib 1995) Consider a posterior distribution 7r{0i,02, 0s\x) such that the 
three full conditional distributions 7 t(^i|^2, ^ 3 , x), 7t(^2|^i, ^ 3 , a:), and 7 r(^ 3 l 0 i, ^ 2 , x) 
are available. 

(a) Show that 



logm(x) == log/(x|^) -hlogTr(^) - log 7 t(<93|^i, ^ 2 , x) 
-log7r(^2|^i,x) - log7r(^i|a:) , 

where (^ 1 ,^ 2 , ^ 3 ) are arbitrary values of the parameters that may depend 
on the data. 
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9172 9350 9483 9558 9775 10227 
10406 16084 16170 18419 18552 18600 
18927 19052 19070 19330 19343 19349 
19440 19473 19529 19541 19547 19663 
19846 19856 19863 19914 19918 19973 
19989 20166 20175 20179 20196 20215 
20221 20415 20629 20795 20821 20846 
20875 20986 21137 21492 21701 21814 
21921 21960 22185 22209 22242 22249 
22314 22374 22495 22746 22747 22888 
22914 23206 23241 23263 23484 23538 
23542 23666 23706 23711 24129 24285 
24289 24366 24717 24990 25633 26960 
26995 32065 32789 34279 

Table 11.1. Velocity (km/second) of galaxies in the Corona Borealis Region. 

{Source: Roeder 1992.) 



(b) Show that 7 t(^i|x) can be approximated by 

t=l 

where the , 62 "^ , are generated by Gibbs sampling. 

(c) Show that t^{ 92 \ 0 i^x) can be approximated by 

1 ^ 

*( 6 * 2 1 ^ 1 , a:) = -^ 7 r( 6 » 2 |^i, 6 l^‘\a;), 

t=\ 

where the {O ^"^ , 9 ^^ ) ’s are generated by Gibbs sampling from the conditional 
distributions 7 t(^ 2 |^i, x) and 7 t(^ 3 |^i, x), that is, with 9\ being 
kept equal to 9\. 

(d) Apply this marginalization method to the setup of Example 11.1 and the 
data in Table 11 . 1 . 

(e) Evaluate the computing cost of this method. 

{Note: See Chib and Jeliazkov 2001 for an extension to larger numbers of com- 
ponents.) 

11.9 A saturation scheme due to Carlin and Chib (1995) is to consider all models 
at once. Given Mk (/c = l,-- - ,K) with corresponding priors 7Tk{9k), and prior 
weights pfc, the parameter space is 



K 

e = {i,...,K} xYlOk. 

k=l 

(a) If denotes the model indicator, show that the posterior distribution is 

K 

7r(/i,^i, . . .,9k\x) oc p^f^{x\9^) TVk{9k ) . 

k=i 
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(b) Show that 

m{x\^ = j) = J fj{x\9j)Tr{9i, . . . ,9 k\ij. = j) d9 = J fj{x\9j)7rj(9j) d9j 

does not depend on the 7Tfc(^fc)’s for k ^ j. 

(c) Deduce that, when in model that is, when = j, the parameters Ok for 
k ^ j can be simulated using arbitrary distributions 7Tfc(^fc|M = j)- {Note: 
Those are called pseudo-priors by Carlin and Chib 1995.) 

(d) Construct a corresponding Gibbs sampler on (/i, (^i, . . . , ^x)), where fi is 
generated from 

P{ti = j\x,9i , . . . ,^k) oc Pjfj{x\9j)'Kj{9) rck{9k\p = j) ■ 

(e) Comment on the costliness of the method when K is large, namely, on the 
drawback that all models Aik must be considered at every stage. 

11.10 (Continuation of Problem 11.9) On a dataset of n pine trees, the grain 
(strength) yi is regressed against either the wood density 

A4i : = a + pXi + (jEi , 

or a modified (resin adapted) density, Zi^ 



Ai2 : Pi = 7 + • 

Both model parameters are associated with conjugate priors. 



^j\r 



3000 

185 



10 ® 0 
0 10 “^ 



cr^, ~ IQ {a, b) , 



with (a, 6) = (1, 1). For the pseudo-priors 

a\fi = 2r^ A/'(3000, 52^) , /?|/i = 2 - A/'(185, 12^) , 

7l/i = 1 - A/'(3000, 43^) , - 1 - A/'(185, 9^) , 



and cr^T^ - IQ{a,b), describe the saturated algorithm of Carlin and Chib 
(1995). {Note: For their dataset, the authors were forced to use disproportionate 
weights, pi = .9995 and p 2 = .0005 in order to force visits to the model A\\.) 

11.11 In the setting of Example 11.7, for the birth and death moves, 

(a) If (pi, . . . ,pfc) ~ V{ai , . . . ,a/c), show that pk ~ Be{ak,cxi + . . . + afc-i). 

(b) Deduce that, if 



MGk) = X PTk '■■■Ptk 

the reversible jump acceptance ratio is indeed (11.5) when is 

simulated from Be{ak,ai -f-... + afc_i) and {p(^k+i){k-i-i),(^(k+i)(k+i)) from 

7T^. 

11.12 Show that a mixture model Alfc as in Example 11.7 is invariant under per- 
mutation of its components and deduce that the parameter Ok is identifiable up 
to a permutation of its components. 
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11.13 In the setup of the AR(p) model of Example 11.3, it is preferable to impose 
stationarity (or causality) on the model (11-7) through the constraint |Ai| < 1 
on the roots Ai, as described in Example 11.8. The case when some roots are 
at the boundary of this constrained space |Ai| = 1 is sometimes singled out in 
economics and finance, as an indicator of a special behavior of the time series. 
Such roots are called unit roots. 

(a) By considering the particular case p = 1 and Ai = 1, show that the model 
(11.7) is not stationary at the boundary. 

(b) Show that the prior distribution defined in Example 11.8 does not put 
weight on the boundary. 

(c) If we denote by Up and Up the number of real and complex unit roots in 
Alp, respectively, write the uniform prior on the A^’s when there are Up real 
and Up complex unit roots. 

(d) Derive a reversible jump algorithm in the spirit of Example 11.8, by intro- 
ducing moves within Mp that change either Up or Up. 

11.14 A switching AR model is an autoregressive model where the mean can switch 
between two values as follows. 

Xt\xt-l,Zt,Zt-l - /izt_i),cr^) 

Zt\zt -1 - + (1 - pzt-i)h-zt-i(zt), 

where Zt G {0, 1} is not observed, with zo 0 and xo = po- 
(a) Construct a data augmentation algorithm by showing that the conditional 
distributions of the 2 :t’s (1 < t < T) are 



P{Zt\zt-i, Zt+l, Xt, Xt-l, xt+i) oc 
exp{-^[(xt 

+{xt+i - - ^{xt - Hzt)f] I 

X (pzt_ilzt-l "t" (1 “ Pzt-\)^l-Zt- \ (-^^t)) 

X + (1 - pzt)"^l-zt{zt+\)) , 

with appropriate modifications for the limiting cases Z\ and Zt- 
(b) Under the prior 

(mi - Mo) ~ A/"(0,C^), = 1/ff, M0,M1 ~ 

show that the parameters can be generated as 

Mo ~ A/'([noo(l - <M)(yoo “ Woo) + noi<M(Mi - Voi + Woi) 
+nio(<MMi + ^10 - Wio) + [^ioo(l - ¥>)^ + noMP^ 

+mo + , <T^[noo(l - <m)^ + noiip^ + nio + , 



Ml ~ ^([1111(1 - i^)( 2 /ii - <MMii) + nio<M(Mo - Mio + Wio) 
+noi (ipMo + Moi - Woi) +C~^o'^Mo] [nii(l - <p)^ + mo<M^ 
+T101 + ,a^[mi(l - ifiY +«io<^^ +noi +C^o’^]“‘) , 
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V Et=i(2/t-i Et=i(2/‘-i 

<T^ ~T0 I i - ^p{yt-i - 

V ^ t=i 

po ~ Be{noo + 1, noi + 1), 
pi ~ 5e(nii + 1, mo + 1), 

where nij denotes the number of jumps from i to j and 



iVij = ^ yt^ 



Tiij yij 



= E 



yt-i- 



Zt=j 



Zt-l=t 

zt=0 



(c) Generalize this algorithm to a birth-and-death reversible jump algorithm 
when the number of means pi is unknown. 

11.15 Consider a sequence Xi, . . . , Xn such that 



N{pi,r‘^) if z < zo 
J\f{p 2 ,r^) if z > zo. 



(a) When zo is uniformly distributed on {l,...,n — 1}, determine whether the 
improper prior 7 t(/zi, /Z 2 , r) = 1/r leads to a proper posterior. 

(b) Let A4k denote the model with change-point at i = k. Derive a Monte Carlo 
algorithm for this setup. 

(c) Use an importance sampling argument to recycle the sample produced by 
the algorithm in part (b) when the prior is 

7r(Ati,M2,r) = exp {-{m - ii 2 f jr^) /t^. 

11.16 Here we look at an alternative formulation for the situation of Problem 11.15. 

(a) Show that an alternative implementation is to use (n — 1) latent variables 
Zi which are the indicators of change-points. 

(b) Extend the analysis to the case where the number of change-points is un- 
known. 

11.17 The analysis of counting data is sometimes associated with a change-point 

representation. Consider a Poisson process 



Yi - V{eu) (z < r) , V - V(\ii) (z > r) , 



with 



and 



^ ~ 0a(ai,/?i) , A 5a(a2,/?2) , 
(3\ ~ Qa{6\^ei) , ~ Gcl(S2,S2) , 



where e^, and Si are assumed to be known (z = 1, 2). 

(a) Check that the posterior distribution is well defined. 

(b) Show that the following algorithm 
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Decade/ Year 


0123456789 


1850 


- 


4 


5 


4 


1 


0 


4 


3 


4 


0 


1860 


6 


3 


3 


4 


0 


2 


6 


3 


3 


5 


1870 


4 


5 


3 


1 


4 


4 


1 


5 


5 


3 


1880 


4 


2 


5 


2 


2 


3 


4 


2 


1 


3 


1890 


2 


2 


1 


1 


1 


1 


3 


0 


0 


1 


1900 


0 


1 


1 


0 


0 


3 


1 


0 


3 


2 


1910 


2 


0 


1 


1 


1 


0 


1 


0 


1 


0 


1920 


0 


0 


2 


1 


0 


0 


0 


1 


1 


0 


1930 


2 


3 


3 


1 


1 


2 


1 


1 


1 


1 


1940 


2 


4 


2 


0 


0 


0 


1 


4 


0 


0 


1950 


0 


1 


0 


0 


0 


0 


0 


1 


0 


0 


1960 


1 


0 


1 


- 


- 


- 


- 


- 


- 


- 



Table 11.2. Yearly number of mining accidents in England for the years 1851 to 
1962, from Maguire et al. (1952) and Jarrett (1979) datasets. {Source: Carlin et al. 
1992.) 



Algorithm A. 51 -Changepoint Poisson Model- 

1. Generate (1 < A: < n) 



T ^ P(r ^ k) (X exp 
2 . Generate 

$ ^ Qa 






1^.51] 



f Ofi + ^ yi, 4- ^ ti I , 

\ i=i 1=1 / 

Qa I 02 + Vi 1 02 ti I 

V i=T+l i=T+l / 



3 . Generate 

^ Qa(Si + Qi, ^ H- £i), ft ^ + 02, A + £ 2 ) 



is a valid Gibbs sampler. 

(c) Show that the model can also be represented by associating with each obser- 
vation yi a latent variable Zi G {0, 1} which is the indicator of the change- 
point. Deduce that this representation does not modify the above algorithm. 

(d) For the data in Table 11.2, which summarizes the number of mining acci- 
dents in England from 1851 to 1962, from 1851 to 1962, and for the hyper- 
parameters ai = a2 = 0.5, 61 = 62 = 0, and £1 = £2 = 1, apply the above 
algorithm to estimate the change-point. {Note: Raftery and Akmian 1986 
obtain estimates of k located between 1889 and 1892.) 

11.18 Barry and Hartigan (1992, 1993) study another change-point model where 
the number u of change-points is random. They introduce the notion of a 
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partition product; assuming that the probability that i/ is equal to k and 
that the change-points are in 1 < < . . . < u < n, this is given by 

p(/c,n, . . . ,Zfc) oc where 



Cij 



(j — i)~^ for 1 < 2 < j < n 

^ U ~ for 2 = 1 or j = n 

n~^ for 2 = 1 and j = n. 



The distributions on the observables Xi and their mean are normal (ij-i < 
2 < ij) 

Xi ~ , a") , 6, ~ AA f/xo, . 

V ^3 - 1 

while the prior distributions on the parameters cr, ao, / 20 , and p are 



7r(cr^) = 1/cr^ 



^2 

2 ”— - 2 1^ ^ ^[ 0 , 1 ^ 0 ]’ ^(/^o) 1 and p ~ ^[o,po] 

(7n -h (J 



with po < 1 and ico < 1- 

(a) Show that the posterior distribution of 2 /, 21 , . . . , ii., , . . . , is 

7t(za,2i, . . . ,^n|cT,O'0,P,/20,Xl, . . . , Xn) OC . . . d^n 

-l)tei-Mo)^l / (n - - tJ’of \ 






X exp ^ — - 

X n n ‘ 

jf = l 

X (ii — 1)^^^ ...{ij — ...{n- 



2<7q 



2(t2 



(b) Show that integrating out leads to 

7t(z/,2i, . . . ,iu\(T, (70,P, /20,Xi, . . . , Xn) OC C\i^ . . . 



i^+l 



ii-1 



X n exp ^ 

j = l I 



{xi - Xi-_^)^ {ij - ij-i){xij_j^ - pioY 



2^2 



2{a^ + al) 



(c) Deduce that the posterior distribution is well-defined. 

(d) Construct a reversible jump algorithm for the simulation of (i/, 21 , . . . , 2 iy). 
11.19 (a) We now extend the model of Problem 11.17 to the case when the number 

of change-points, k, is unknown. We thus have change-points 1 < n < • • • < 
Tk <n such that, for ri < j < ri — Yj ^ V{6i-\-itj). Using the priors 



9i ~ Qa{a, Pi), pi ~ Ga{6, e) , 



a uniform prior on (ri,...,Tfc), and k ~ V{\), derive a reversible jump 
algorithm for the estimation of k. 

(b) Show that the latent variable structure described in (c) can be extended to 
this setup and that it avoids the variable dimension representation. Derive 
a Gibbs sampler corresponding to this latent variable structure. 

11.20 Show that, if Ai, . . . , Xn are iid Exp{\), the random variable min(Ai, . . . , Xn) 
is distributed from an Sxp{n\) distribution. Generalize to the case when the 
Ai’s are Sxp{Xi). 
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11.21 A typical setup of convolution consists in the observation of 

k 

Yt = ^2 St , t = 1, . . . , n, 

j=o 

where St ~ A/^(0,cr^), the Xi's are not observed but have finite support, X = 
{si, . . . , Sm}, and where the parameters of interest hj are unknown. The discrete 
variables Xt are iid with distribution Mm{0i , . . . ,6m) on X, the probabilities 
6i are distributed from a Dirichlet prior distribution, X>m(7i, ... ,7m), and the 
coefficients hj are from a conjugate distribution h = (ho, . . . ,hk) X). 

(a) Show that the completed likelihood is 



L{x,h,a) (xa exp | ^ hjXt-j 



j=Q 



and examine whether or not the EM algorithm can be implemented in this 
setting. 

(b) Verify that the completion of Yt in (Yt,Xt) can be done as follows: 

Algorithm A. 52 —Deconvolution Completion- 

1 . Generate {t = 1 - k, . . . ,n) 



Xt ^ P{xt = 5f) oc 



^ I i=o 

. / . \ 

H" I Vi+j “ hiXt+j-t 

j=Q 



iz;:0 



and derive 



[A.52] 



A = 



f xi xq ... :ci-fc 

^ Xn —1 • - - j 



h = (X*X)-^X‘y , 



Where y = [yi , . . . ,2/n). 

(c) Verify that the update of the parameters (h,a^) is given by 

2 . Generate 



h' 



' Afk+I V + a~‘^X*Xh), (r"^ + a~‘^X^X)~^ 



(7^ 



I ( 2 /^ “ I ] 

j=Q 



while the step for hyperparameters = 1, . . . , m) corresponds to 

3 . Generate 

/ n n 

(^1 ,..., ^rn) 7 ^ ( 7l 4“ ^ ^ , • • • , 7m + ^ ^ ^xt = Sm 
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(d) When k is unknown, with a prior k 'P(A), develop a reversible jump 
algorithm for estimating /c, using the above completion and a simple birth- 
death proposal based on moves from Aik to Aik-i and to Aik+i- 
{Note: See Kong et al. 1994, Gassiat 1995, Liu and Chen 1995, and Gamboa 
and Gassiat 1997 for more detailed introductions to deconvolution models.) 
11.22 (Neal 1999) Consider a two-layer neural network (or perceptron), introduced 
in Note 5.5.2, where 

{ hk = tanh (ato + ockjXj^ , k = \,...,p, 

E[y£|/i] = /?£o + 

Propose a reversible jump Monte Carlo method when the number of hidden 
units p is unknown, based on observations (xj^pj) {j = 1, . . . , n). 

11.5 Notes 

11.5.1 Occam’s Raizor 

Associated with the issues of model choice and variable selection, the so-called Oc- 
cam’s Razor rule is often used as a philosophical argument in favor of parsimony, 
that is, simpler models. William d’Occam or d’Ockham (ca. 1290- ca. 1349) was 
a English theologian from Oxford who worked on the bases of empirical induction 
and, in particular, posed the principle later called Occam’s Razor^ which excludes 
a plurality of reasons for a phenomenon if they are not supported by some experi- 
mentation (see Adams 1987). This principle, 

Pluralitas non est ponenda sine neccesitate, 

(meaning Entities are not to be multiplied without necessity)^ is sometimes invoked 
as a parsimony principle to choose the simplest between two equally possible ex- 
planations, and its use is recurrent in the (recent) Bayesian literature. However, it 
does not provide a working principle per se. (At a more anecdotal level, Umberto 
Eco’s The Name of the Rose borrows from Occam to create the character William of 
Baskerville.) See also Wolpert (1992, 1993) for an opposite perspective on Occam’s 
Razor and Bayes factors. 
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Diagnosing Convergence 



“Why does he insist that we must have a diagnosis? Some things are not 
meant to be known by man.” 

— Susanna Gregory, An Unholy Alliance 



12.1 Stopping the Chain 

In previous chapters, we have presented the theoretical foundations of MCMC 
algorithms and showed that, under fairly general conditions, the chains pro- 
duced by these algorithms are ergodic, or even geometrically ergodic. While 
such developments are obviously necessary, they are nonetheless insufficient 
from the point of view of the implementation of MCMC methods. They do not 
directly result in methods of controlling the chain produced by an algorithm 
(in the sense of a stopping rule to guarantee that the number of iterations 
is sufficient). In other words, while necessary as mathematical proofs of the 
validity of the MCMC algorithms, general convergence results do not tell us 
when to stop these algorithms and produce our estimates. For instance, the 
mixture model of Example 10.18 is fairly well behaved from a theoretical 
point of view, but Figure 10.3 indicates that the number of iterations used is 
definitely insufficient. 

Perhaps the only sure way of guaranteeing convergence is through the 
types of calculations discussed in Example 8.8 and Section 12.2.1, where a 
bound on the total variation distance between the n^^-order transition kernel 
and the stationary distribution is given. Then, for a specified total 
variation distance, the needed value of n can be solved for. Unfortunately, such 
calculations are usually quite difficult, and are only feasible in relatively simple 
settings. Thus, for generally applicable convergence assessment strategies, we 
are left with empirical methods. 

Example 12.1. Probit model revisited. The probit model defined in Ex- 
ample 10.21 is associated with the posterior distribution 
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Fig. 12.1. Contour plot of the log-posterior distribution for a probit sample of 
1,000 observations, along with 1,000 points of an MCMC sample. 



i=l 

where 7t(/ 3, a^) is the prior distribution and the pairs (n, di) the observations. 
In the special case where 

7t(/3, cr^) = exp{ — l/(j^} exp{— /3^/50) , 

a contour plot of the log-posterior distribution is given in Figure 12.1, along 
with the last 1,000 points of an MCMC sample after 100,000 iterations. 
This MCMC sample is produced via a simple Gibbs sampler on the pos- 
terior distribution where (3 and are alternatively simulated by normal 
and log-normal random walk proposals and accepted by a one-dimensional 
Metropolis-Hastings step. 

While Example 10.21 was concerned with convergence difficulties with the 
Gibbs sampler, this (different) implementation does not seem to face the same 
problem, at least judging from a simple examination of Figure 12.1, since 
the simulated values coincide with the highest region of the log-posterior. 
Obviously, this is a very crude evaluation and a more refined assessment is 
necessary before deciding whether using the MCMC sample represented in 
Figure 12.1 in ergodic averages, or increasing the number of simulations to 
achieve a more reliable approximation. || 

The goal of this chapter is to present, in varying amounts of detail, a 
catalog of the numerous monitoring methods (or diagnostics) proposed in the 
literature, in connection with the review papers of Cowles and Carlin (1996), 
Robert (1995a), Brooks (1998b), Brooks and Roberts (1998), and Mengersen 
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et al. (1999). Some of the techniques presented in this chapter have withstood 
the test of time, and others are somewhat exploratory in nature. We are, 
however, in the situation of describing a sequence of noncomparable techniques 
with widely varying degrees of theoretical justification and usefulness.^ 



12.1.1 Convergence Criteria 

Prom a general point of view, there are three (increasingly stringent) types of 
convergence for which assessment is necessary: 



(i) Convergence to the Stationary Distribution 

This criterion considers convergence of the chain 6^^"^ to the stationary distri- 
bution / (or stationarization) , which seems to be a minimal requirement for an 
algorithm that is supposed to approximate simulation from /! Unfortunately, 
it seems that this approach to convergence issues is not particularly fruitful. 
In fact, from a theoretical point of view, / is only the limiting distribution of 
9 ^'^^ . This means that stationarity is only achieved asymptotically.^ 

However, the original implementation of the Gibbs sampler was based on 
the generation of n independent initial values {i = l,...,n), and the 
storage of only the last simulation in each chain. While intuitive to some 

extent — the larger T is, the closer is to the stationary distribution — this 

criterion is missing the point. If /xq is the (initial) distribution of then 
the ’s are all distributed from . In addition, this also results in a waste 
of resources, as most of the generated variables are discarded. 

If we, instead, consider only a single realization (or path) of the chain (0^^^), 
the question of convergence to the limiting distribution is not really relevant! 
Indeed, it is possible^ to obtain the initial value 6^^^ from the distribution /, 
and therefore to act as if the chain is already in its stationary regime from the 
start, meaning that belongs to an area of likely (enough) values for /. 

This seeming dismissal of the first type of control may appear rather cav- 
alier, but we do think that convergence to / per se is not the major issue for 
most MCMC algorithms in the sense that the chain truly produced by the 

^ Historically, there was a flurry of papers at the end of the 90s concerned with 
the development of convergence diagnoses. This flurry has now quieted down, the 
main reason being that no criterion is absolutely foolproof, as we will see later 
in this chapter. As we will see again in the introduction of Chapter 13, the only 
way of being certain the algorithm has converged is to use iid sampling! 

^ This perspective obviously oversimplifies the issue. As already seen in the case of 
renewal and coupling, there exist finite instances where the chain is known to be 
in the stationary distribution (see Section 12.2.3). 

^ We consider a standard statistical setup where the support of / is approximately 
known. This may not be the case for high-dimensional setups or complex struc- 
tures where the algorithm is initialized at random. 
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algorithm often behaves like a chain initialized from /. The issues at stake 
are rather the speed of exploration of the support of / and the degree of 
correlation between the This is not to say that stationarity should not 

be tested at all. As we will see in Section 12.2.2, regardless of the starting 
distribution, the chain may be slow to explore the different regions of the sup- 
port of /, with lengthy stays in some regions (for example, the modes of the 
distribution /). A stationarity test may be useful in detecting such difficulties. 



(ii) Convergence of Averages 

Here, as in regular Monte Carlo settings, we are concerned with convergence 
of the empirical average 



( 12 . 1 ) ^ 

t=l 

to Ef[h{0)] for an arbitrary function h. This type of convergence is most rele- 
vant in the implementation of MCMC algorithms. Indeed, even when 9^^^ ~ /, 
the exploration of the complexity of / by the chain {0^'^^) can be more or less 
lengthy, depending on the transition chosen for the algorithm. The purpose 
of the convergence assessment is, therefore, to determine whether the chain 
has exhibited all the features of / (for instance, all the modes). Brooks and 
Roberts (1998) relate this convergence to the mixing speed of the chain, in 
the informal sense of a strong (or weak) dependence on initial conditions and 
of a slow (or fast) exploration of the support of / (see also Asmussen et al. 
1992). A formal version of convergence monitoring in this setup is the con- 
vergence assessment of Section 12.1. While the ergodic theorem guarantees 
the convergence of this average from a theoretical point of view, the relevant 
issue at this stage is to determine a minimal value for T which justifies the 
approximation ofKf[h{9)] by (12.1) for a given level of accuracy. 



(Hi) Convergence to iid Sampling 

This convergence criterion measures how close a sample {6 ^^ , . . . , On ^ ) is to 
being iid. Rather than approximating integrals such as E/[/i(^)], the goal is 
to produce variables 6i which are (quasi-) independent. While the solution 
based on parallel chains mentioned above is not satisfactory, an alternative 
is to use suhsampling (or batch sampling) to reduce correlation between the 
successive points of the Markov chain. This technique, which is customarily 
used in numerical simulation (see, for instance, Schmeiser 1989) subsamples 
the chain with a batch size k, considering only the values 

If the covariance decreases monotonically with t (see Section 

9.3), the motivation for subsampling is obvious. In particular, if the chain 
satisfies an interleaving property (see Section 9.2.2), subsampling is justified. 
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However, checking for the monotone decrease of — ^which also 

justifies Rao-Blackwellization (see Section 9.3) — is not always possible and, 
in some settings, the covariance oscillates with t, which complicates the choice 
of k. 

We note that subsampling necessarily leads to losses in efficiency with re- 
gard to the second convergence goal. In fact, as shown by MacEachern and 
Berliner (1994) , it is always preferable to use the entire sample for the approxi- 
mation ofE f[h{9)]. Nonetheless, for convergence assessment, subsampling may 
be beneficial (see, e.g., Robert et al. 1999). 

Lemma 12.2. Suppose h e and (0^^^) is a Markov chain with station- 

ary distribution f . If 

Tk T 

<^1 = ^ E = I E 

^ t=i 

the variance of 5\ satisfies 

var((5i) < var(4), 

for every k > 1. 

Proof. Define S^.,. . . , as the shifted versions of Sk = that is. 

1 ^ 

t=l 

The estimator can then be written as (5i = | X^i=o ^1’ hence 
var((5i) = var 

= var((5^)/fc + ^ cov(Si, Sl)/k^ 

< var(S^)/k + ^ var(6^)/k^ 

= var(dfe) , 

where the inequality follows from the Cauchy-Schwarz inequality 

\cov{Sl,Sl)\ < var((5^). 



□ 



In the remainder of the chapter, we consider only independence issues 
in cases where they have bearing on the control of the chain, as in renewal 
theory (see Section 12.2.3). Indeed, for an overwhelming majority of cases 
where MCMC algorithms are used, independence is not a necessary feature. 
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12.1.2 Multiple Chains 

Aside from distinguishing between convergence to stationarity (Section 12.2) 
and convergence of the average (Section 12.3), we also distinguish between 
the methods involving the parallel simulation of M independent chains {6^) 
{1 < m < M) and those based on a single “on-line” chain. The motivation of 
the former is intuitively sound. By simulating several chains, variability and 
dependence on the initial values are reduced and it should be easier to con- 
trol convergence to the stationary distribution by comparing the estimation, 
using different chains, of quantities of interest. The dangers of a naive imple- 
mentation of this principle should be obvious, namely that the slower chain 
governs convergence and that the choice of the initial distribution is extremely 
important in guaranteeing that the different chains are well dispersed. 

Many multiple- chain convergence diagnostics are quite elaborate (Gelman 
and Rubin 1992, Liu et al. 1992) and seem to propose convergence evaluations 
that are more robust than single-chain methods. Geyer (1992) points out that 
this robustness may be illusory from several points of view. In fact, good 
performances of these parallel methods require a degree of a priori knowledge 
on the distribution / in order to construct an initial distribution on which 
takes into account the features of / (modes, shape of high density regions, 
etc.). For example, an initial distribution which is too concentrated around 
a local mode of / does not contribute significantly more than a single chain 
to the exploration of /. Moreover, slow algorithms, like Gibbs sampling used 
in highly nonlinear setups, usually favor single chains, in the sense that a 
unique chain with MT observations and a slow rate of mixing is more likely 
to get closer to the stationary distribution than M chains of size T, which 
will presumably stay in the neighborhood of the starting point with higher 
probability. 

An additional practical drawback of parallel methods is that they require 
a modification of the original MGMG algorithm to deal with the processing 
of parallel outputs. (See Tierney 1994 and Raftery and Lewis 1996 for other 
criticisms.) On the other hand, single-chain methods suffer more severely from 
the defect that ‘^you’ve only seen where you’ve been,” in the sense that the 
part of the support of / which has not been visited by the chain at time T is 
almost impossible to detect. Moreover, a single chain may present probabilistic 
pathologies which are possibly avoided by parallel chains. (See the example 
of Figure 12.16, as opposed to the sequential importance sampling resolution 
of Section 14.4.4.) 

As discussed in Ghapter 14, and in particular in the population Monte 
Garlo section. Section 14.4, there exist alternative ways of taking advantage 
of parallel chains to improve and study convergence, in particular for a better 
assessment of the entire support of /. These algorithms outside the MGMG 
area can also be used as benchmarks to test the convergence of MGMG algo- 
rithms. 




12.2 Monitoring Convergence to the Stationary Distribution 465 



12.1.3 Monitoring Reconsidered 

We agree with many authors^ that it is somewhat of an illusion to think we can 
control the flow of a Markov chain and assess its convergence behavior from 
a few realizations of this chain. There always are settings (transition kernels) 
which, for most realizations, will invalidate an arbitrary indicator (whatever 
its theoretical justiflcation) and the randomness inherent to the nature of the 
problem prevents any categorical guarantee of performance. The heart of the 
difficulty is the key problem of statistics, where the uncertainty due to the 
observations prohibits categorical conclusions and final statements. Far from 
being a failure acknowledgment, these remarks only aim at warning the reader 
about the relative value of the indicators developed below. As noted by Cowles 
and Carlin (1996), it is simply inconceivable, in the light of recent results, to 
envision automated stopping rules. Brooks and Roberts (1998) also stress that 
the prevalence of a given control method strongly depends on the miodel and 
on the inferential problem under study. It is, therefore, even more crucial to 
develop robust and general evaluation methods which extend and complement 
the present battery of stopping criteria. One goal is the development of “con- 
vergence diagnostic spreadsheets,” in the sense of computer graphical outputs, 
which would graph several different features of the convergence properties of 
the chain under study (see Cowles and Carlin 1996, Best et al. 1995, Robert 
1997, 1998, or Robert et al. 1999). 

The criticisms presented in the wake of the techniques proposed below 
serve to highlight the incomplete aspect of each method. They do not aim at 
preventing their use but rather to warn against a selective interpretation of 
their results. 



12.2 Monitoring Convergence to the Stationary 
Distribution 

12.2.1 A First Illustration 

As a first approach to assessing how close the Markov chain is to stationarity, 
one might try to obtain a bound on \\K'^{x, •) ~ /IItv, t,he total variation dif- 
ference between the step transition kernel and the stationary distribution 
(see Section 6.6). For example, in the case of the independent Metropolis- 
Eastings algorithm, we know that ||AT’^(x, •) — fWrv ^ 2(1 — M~^)^ (The- 
orem 7.8). Using an argument based on drift conditions (Note 6.9.1), Jones 
and Robert (2001) were able to obtain bounds on a number of geometri- 
cally ergodic chains. However, although their results apply to a wide class of 

^ To borrow from the injunction of Hastings (1970), “e^;en the simplest of numerical 
methods may yield spurious results if insufficient care is taken in their use. . . The 
setting is certainly no better for the Markov chain methods and they should be 
used with appropriate caution.^' 
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Markov chains, the calculations needed to obtain the analytical bounds can 
be prohibitively difficult. 

A natural empirical approach to convergence control is to draw pictures of 
the output of simulated chains, in order to detect deviant or non-stat ionary 
behaviors. For instance, as in Gelfand and Smith 1990, a first plot is to draw 
the sequence of the 0^^^’s against t. However, this plot is only useful for strong 
non-stationarities of the chain. 

Example 12.3. (Continuation of Example 12.1) For the MCMC sample 
produced in Example 12.1, simple graphs may be enough to detect a lack of 
convergence. For instance. Figures 12.2 and 12.3 represent the evolution of /? 
along 1000 and 10, 000 iterations of the Gibbs sampler. They clearly show that 
the time scale required for the convergence of the MCMC algorithm is much 
bigger than these quantities. It is only at the scale of 100, 000 iterations (Figure 
12.4) that the chain seems to take a more likely turn towards stationarity. 
But things are not so clear-cut when looking at the last simulated values: 
Figure 12.5 presents the same heterogeneities as Figure 12.3. This indicates 
that the Markov chain, rather than being slow to converge to the stationary 
distribution, is slow to explore the support of the posterior distribution; that 
is, it is mixing very slowly. This feature can also be checked on Figure 12.6, 
which indicates the values of the log-posterior at consecutive points of the 
Markov chain. 

Another interesting feature, related to this convergence issue, is that some 
aspects of the posterior may be captured more quickly than others, reinforcing 
our point that the fundamental problem is mixing rather than convergence per 
se. When looking at the marginal posterior distribution of the stabiliza- 
tion of the empirical distribution is much faster, as shown by the comparison 
of the first and of the last 5, 000 iterations (out of 100, 000) in Figure 12.7. 
This is due to the fact that (3 /a is an identifiable parameter for the probit 
model P{D = l|i^) = ^{R/3/a). (For the simulated dataset, the true value of 
(3 /a is 0.89, which agrees quite well with the MCMC sample being distributed 
around 0.9.) || 



12.2.2 Nonparametric Tests of Stationarity 

Standard nonparametric tests, such as Kolmogorov-Smirnov or Kuiper tests 
(Lehmann 1975), can be applied in the stationarity assessment of a single 
output of the chain In fact, when the chain is stationary, 6^*^^ and 
have the same marginal distribution for arbitrary times t\ and ^ 2 - Given an 
MCMC sample . . . , it thus makes sense to compare the distributions 
of the two halves of this sample, . . . , and . . . , 

Since usual nonparametric tests are devised and calibrated in terms of iid 
samples, there needs to be a correction for the correlation between the ’s. 
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Fig. 12.2. Evolution of the Markov chain along the first 1,000 iterations 

when = (1,2). 
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Fig. 12.3. Evolution of the Markov chain along the first 10,000 iterations. 
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Fig. 12.5. Evolution of the Markov chain along the last 10,000 iterations 

out of 100. 000. 



Fig. 12.6. Evolution of the value of along 100,000 iterations. 



This correction can be achieved by the introduction of a batch size G 
leading to the construction of two (quasi-) independent samples. For each of 
the two halves above, select subsamples . . .) and . . .). 

Then, for example, the Kolmogorov-Smirnov statistic is 



(12.2) - E 

^ 9=1 9 =^ 

in the case of a one-dimensional chain. For multidimensional chains, (12.2) 
can be computed on either a function of interest or on each component of the 
vector 

The statistic K can be processed in several ways to derive a stopping rule. 
First, under the stationarity assumption as M goes to infinity, the limiting 
distribution of \/M K has the cdf 







12.2 Monitoring Convergence to the Stationary Distribution 469 




Fig. 12.7. Comparison of the distribution of the sequence of over the 

first 5,000 iterations (top) and the last 5,000 iterations (bottom), along 100,000 
iterations. 



oo 

(12.3) R{x) = 1 - , 

k=l 

which can be easily approximated by a finite sum (see Problem 12.2). The 
corresponding p-value can therefore be computed for each T until it gets 
above a given level. (An approximation of the 95% quantile, = 1-36 (for 
M > 100), simplifies this stage.) Of course, to produce a valid inference, we 
must take into account the sequential nature of the test and the fact that K 
is computed as the infimum over all components of of the corresponding 
values of (12.2). 

An exact derivation of the level of the derived test is quite difficult given the 
correlation between the 6^^^ ’s and the influence of the subsampling mechanism. 
Another use of (12.2), which is more graphic, is to represent the sample of 
\/M Kt s against T and to check visually for a stable distribution around 
small values. (See also Brooks et al. 2003a, for another approach using the 
Kolmogorov-Smirnov test) 
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Fig. 12.8. Plot of 100 Kolmogorov-Smirnov statistics for T = 1000 and T = 10, 000 
iterations. The dotted line corresponds to the 95% level. 



Obviously, an assessment of stationarity based on a single chain is open to 
criticism: In cases of strong attraction from a local mode, the chain will most 
likely behave as if it was simulated from the restriction of / to the neighbor- 
hood of this mode and thus lead to a convergence diagnosis (this is the ‘‘you ’ve 
only seen where you’ve been” defect mentioned in Section 12.1.2). However, 
in more intermediate cases, where the chain stays for a while in the 

neighborhood of a mode before visiting another modal region, the subsamples 
{ 0 ^ 1 ^) and (^ 2 ^^) should exhibit different features until the chain explores every 
modal region.^ 

Example 12.4. Nuclear pump failures. In the model of Gaver and 
O’Muircheartaigh (1987), described in Example 10.17, we consider the sub- 
chain (/3^^^) produced by the algorithm. Figure 12.8 gives the values of the 
Kolmogorov-Smirnov statistics K iov T = 1000 and T = 10, 000 iterations, 
with M = 100 and 100 values of lOiC^. Although both cases lead to similar 
proportions of about 95% values under the level 1.36, the first case clearly 
lacks the required homogeneity, since the statistic is almost monotone in t. 
This behavior may correspond to the local exploration of a modal region for 
t < 400 and to the move to another region of importance for 400 <t< 800. || 



12.2.3 Renewal Methods 

Chapter 6 has presented the basis of renewal theory in Section 6.3.2 through 
the notion of small set (Definition 6.19). This theory is then used in Section 
6.7 for a direct derivation of many limit theorems (Theorems 6.63 and 6.64). 
It is also possible to take advantage of this theory for the purpose (s) of conver- 
gence control, either through small sets as in Robert (1995a) or through the 

^ We may point out that, once more, it is a matter of proper scaling of the algorithm. 
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alternative of Mykland et al. (1995), which is based on a less restrictive repre- 
sentation of renewal called regeneration. (See also Gilks et al. 1998, Sahu and 
Zhigljavsky 1998, 2003 for a related updating technique.) The use of small sets 
is obviously relevant in the control of convergence of averages, as presented 
in Section 12.3.3, and as a correct discretization technique; see Section 12.4.2, 
but it is above all a central notion for the convergence to stationarity. 

As we will see in Chapter 13, renewal is a key to the derivation of perfect 
sampling schemes (see Section 13.2.4), in which we can obtain exact samples 
from the stationary distribution. This is done using Kac’s representation of 
stationary distributions (Section 6.5.2). The difficulties with minorization con- 
ditions, namely the derivation of the minorizing measure and the very small 
probabilities of renewal, remain, however, and restrict the practicality of this 
approach for stationarity assessment. 

Mykland et al. (1995) replaced small sets and the corresponding minoriza- 
tion condition with a generalization through functions s such that 

(12.4) 6 eO, Be B{9) . 

Small sets thus correspond to the particular case s(x) = elc{x). If we define 



r(0W,O 



K(0W,O ’ 



then each time t is a renewal time with probability 

Based on the regeneration rate, Mykland et al. (1995) propose a graph- 
ical convergence assessment. When an MCMC algorithm is in its stationary 
regime and has correctly explored the support of /, the regeneration rate must 
remain approximately constant (because of stationarity). Thus, they plot an 
approximation of the regeneration rate. 



t=l 

against T. When the normalizing constant of u is available they instead plot 

t = l 

An additional recommendation of the authors is to smooth the graph of rr 
by nonpar ametric techniques, but the influence of the smoothing parameter 
must be assessed to avoid overoptimistic conclusions (see Section 12.6.1 for a 
global criticism of functional nonparametric techniques). 

Metropolis-Hastings algorithms provide a natural setting for the imple- 
mentation of this regeneration method, since they allow for complete freedom 
in the choice of the transition kernel. Problems related to the Dirac mass in 
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the kernel can be eliminated by considering only its absolutely continuous 
part, 

) = min I 10, ) } , 

using the notation of [^.24]. The determination of s and u is facilitated in 
the case of a so-called pseudo-reversihle transition; that is, when there exists 
a positive function / with 

(12.5) he) qm = /(O 9(^10 • 

Equation (12.5) thus looks like a detailed balance condition for a reversible 
Markov chain, but / is not necessarily a probability density. 

Lemma 12.5. Suppose that q satisfies (12.5) and for the transition induced 
by q, the functions Sq and Uq satisfy (12. f). Let w{6) — f{9)/f{6) and for 
every c> 0, define the function 

s{e) = Sgie) min 1^,1 1 

and the density 

v{6) = Pq{e) min l| . 

then, for the Metropolis-Hastings algorithm associated with q, s{0) and iy{9) 
satisfy (12.f). 

Proof. Since 

P(6i(‘+i) g p|6l(‘)) > [ d^, 

JB 

we have 

P(6l(*+i) g p|6)W) > f min I d^ 

Jb [ w(0 J 

>min| ^^^ \ l| f^ min|-^,l| Sq{0^*^) Ug{^) d^ 

= s(^W) / HOd^. 

Jb □ 

In the particular case of an independent Metropolis-Hastings algorithm, 
q{^\9) = g{^) and (12.5) applies for f = g. Therefore, Sq = 1, Uq = g, and 




12.2 Monitoring Convergence to the Stationary Distribution 473 



which behaves as a truncation of the instrumental distribution g depending 
on the true density /. If ^ ~ ^ is accepted, the probability of regeneration is 
then 



( 12 . 6 ) 






f c 

A w{^) 
, Vu;(0 

c 



if w{^) A w{6^^^) > c 

if w{^) V < c 

otherwise. 



Note that c is a free parameter. Therefore, it can be selected (calibrated) to 
optimize the renewal probability (or the mixing rate). 

For a symmetric Metropolis-Hastings algorithm, q{^\0) = q{0\^) implies 
that f = I satisfies (12.5). The parameters Sq and Uq are then determined by 
the choice of a set D and of a value 6 in the following way: 



oc q{0\e) Id( 0 > 



Sq{6) = inf 



The setting is therefore less interesting than in the independent case since D 
and 6 have first to be found, as in a regular minor izat ion condition. Note that 
the choice of 6 can be based on an initial simulation of the algorithm, using 
the mode or the median of 



Example 12.6. (Continuation of Example 12.4) In the model of Gaver 
and O’Muircheartaigh (1987), D = R_j_ is a small set for the chain (/?^^^), 
which is thus uniformly ergodic. Indeed, 






j7+10a^^/^7+10a-l 

r{10a + 7 ) 



10 / . \ pi-\-a 



(see Example 10.17). The regeneration (or renewal) probability is then 



(12.7) 






^ 7 +lOa ^^/^ 7 + 10 q :-1 

K{(3,l3')r{^ + 10a) ® 




where K{j3^(3') can be approximated by 

1 ^ (5 + E.=i 

M ^ r(7 + 10a) ’ 

with \im ~ Gd{pi + + (3). Figure 12.9 provides the plot of 

against t, and the graph of the averages tt, based on only the first four 
observations (pumps) in the dataset. (This dataset illustrates the problem 
with perfect sampling mentioned at the beginning of this section. The renewal 
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Fig. 12.9. Plot of the probabilities for the pump failure data when 

E+ is chosen as the small set. Superimposed (and scaled to the right), is the average 
fr. 



probabilities associated with 10 observations are quite small, resulting in the 
product in (12.7) being very small, and limiting the practical use.) Mykland 
et al. (1995) thus replace with the interval D = [2.3, 3.1] to achieve renewal 
probabilities of reasonable magnitude. || 

Although renewal methods involve a detailed study of the chain 
and may require modifications of the algorithms, as in the hybrid algorithms 
of Mykland et al. (1995) (see also Section 12.3.3), their basic independence 
structure brings a certain amount of robustness in the monitoring of MCMC 
algorithms, more than the other approaches presented in this chapter. The 
specificity of the monitoring obviously prevents a completely automated im- 
plementation, but this drawback cannot detract from its attractive theoretical 
properties, since it is one of the very few completely justified monitoring meth- 
ods. 

12.2.4 Missing Mass 

Another approach to convergence monitoring is to assess how much of the 
support of the target distribution has been explored by the chain via an eval- 
uation of 

(12.8) [ f{x)dx, 

JA 

if A denotes the support of the distribution of the chain (after a given number 
of iterations). This is not necessarily easy, especially in large dimensions, but 
Philippe and Robert (2001) propose a solution, based on Riemann sums, that 
operates in low dimensions. 
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Indeed, the Riemann approximation method of Section 4.3 provides a sim- 
ple convergence diagnosis since, when / is a one-dimensional density, the quan- 
tity 

T-l 

(12.9) - ^(*)] 

t=l 

converges to 1, even when the ^(t)’s are not generated from the density /. (See 
Definition 4.8 and Proposition 4.9.) Therefore, if the chain (0(t)) has failed to 
explore some (significant) part of the support of /, the approximation (12.9) 
gives a signal of non-convergence by providing an evaluation of the mass of 
the region explored by the chain thus far. 

Example 12.7. Bimodal target. For illustration purposes, consider the 
density 

exp—x‘^/2 4(x — .3)^ + .01 
^ 4(1 + (.3)2) + .01 • 

As shown by Figure 12.10, this distribution is bimodal and a normal random 
walk Metropolis-Hastings algorithm with a small variance like .04 may face 
difficulties moving between the two modes. As shown in Figure 12.11, the 
criterion (12.9) identifies the missing mass corresponding to the second mode 
when the chain has not yet visited that mode. Note that once the chain enters 
the second modal region, that is, after the 800*^ iteration, the criterion (12.9) 
converges very quickly to 1. || 




Fig. 12.10. Density f{x) oc exp — x^/2 (4(x — .3)^ + .01). 



In multidimensional settings, it was mentioned in Section 4.3 that Riemann 
sums are no longer efficient estimators. However, (12.9) can still be used as a 
convergence assessment when Rao-Blackwellization estimates (Section 10.4.2) 
can give a convergent approximation to the marginal distributions. In that 
case, for each marginal / of interest, the Rao-Blackwellized estimate is used 
instead in (12.9). 
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Fig. 12.11. Evolution of the missing mass criterion (12.9) and superposition of the 
normal random walk Metropolis-Hastings sequence, for the target density of Figure 



12 . 10 . 



Example 12.8. Bivariate mixture revisited. Recall the case of the bi- 
variate normal mixture of Example 10.18, 



( 12 . 10 ) 
where /i 



{X, y) ~ r) + (1 - pW2{i', X'), 

(/ii, /i 2 ), — (i^i, 2 ^ 2 ) ^ and the covariance matrices are 






a c 
c 



E' = 



c' y 



In this case, the conditional distributions are also normal mixtures. 



X\y uOyN -h 

Y\x UOxX" f/i 2 + 



{y - 112)0 detr \ 
b ' h ) 

{x — iii)c deti 7 \ 
a ' a J 



-f- (1 — tOy)M H- 

1^2 + 



{y — ^ 2 )c' deti7'\ 

y ’ b' ) 

{y — deti7'\ 
a' ’ a' ) 



where 

^ ^ exp(-(a: - pif/{ 2 a)) 

pa~^!'^ exp(— (x — p,\YI{ 2 a)) + pa'~^^^ exp(— (y — viY / (2a')) 

^ exp(-(j/ - p2f/{‘ih)) 

^ pb~^/'^ exp{-{y - /i2)^/(2&)) + ph'~^^'^ exp(-(y - 1/2Y/ (26')) 



They thus provide a straightforward Gibbs sampler, while the marginal dis- 
tributions of X and Y are again normal mixtures. 



X ~pA/'(/xi,a) + (1 -p)J\f{vi,a') 
Y ~ pAf{p2, 6) + (1 - p)U{i'2, b') , 
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0 2000 *ooo eooo SOOO 1 oooo 



aAmpIfl sfz0 

Fig. 12.12. (top) 2D histogram of the Markov chain of Example 12.8 after 4000, 
6000 and 10,000 iterations; (middle) Path of the Markov chain for the first coordi- 
nate x\ (bottom) Control curves for the bivariate mixture model, for the parameters 
p = (0,0), u = (15,15), p = 0.5, E = E' = (^ 3 ). {Source: Philippe and Robert 
2001 .) 
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which can be used in the mass evaluation (12.9). 

As seen in Example 10.18, when the two components of the normal mixture 
(12.10) are far apart, the Gibbs sampler may take a large number of iterations 
to jump from one component to the other. This feature is thus ideal to study 
the properties of the convergence diagnostic (12.9). As shown by Figure 12.12 
(top and middle), for the numerical values fi = (0,0), u — (15, 15), p = 0.5, 

= Z’' = (Jg), the chain takes almost 5,000 iterations to jump to the 
second component and this is exactly diagnosed by (12.9) (bottom), where the 
evaluations (12.9) for the marginal distributions of both X and Y converge 
quickly to p = 0.5 at first and then to 1 when the second mode is visited. || 

A related approach is to build a nonparametric approximation of the dis- 
tribution of the chain f, and to evaluate the integral (12.8) by an 

importance sampling argument, that is, to compute 

^ 1 /(^W) 

Obviously, the convergence of 3t to 1 also requires that / is available in closed 
form, inclusive of the normalizing constant. If not, multiple estimates as in 
Section 12.3.2 can be used. Again, using nonparametric estimates requires 
that the dimension of the chain be small enough. 

Brooks and Gelman (1998b) propose a related assessment based on the 
score function, whose expectation is zero. 

12.2.5 Distance Evaluations 

Roberts (1992) considers convergence from a functional point of view, as in 
Schervish and Carlin (1992) (and only for Gibbs sampling). Using the norm 
induced by / defined in Section 6.6.1, he proposes an unbiased estimator of the 
distance ||/t - /||, where ft denotes the marginal density of the symmetrized 
chain The symmetrized chain is obtained by adding to the steps 1., 
2 . . .,k. of the Gibbs sampler the additional steps k. ,k-l as in [A.41]. 
This device leads to a reversible chain, creating in addition a dual chain {6^^^ ) , 
which is obtained by the inversion of the steps of the Gibbs sampler: Starting 
with 9^^^ is generated conditionally on by steps 1 . , 2 . . .,k. , then 
^(*+1) is generated conditionally on 0^^^ by steps k. ,k-l . ,. . .,1 . 

Using m parallel chains {Of^) {£ = 1, ... ,m) started with the same initial 
value Roberts (1992) shows that an unbiased estimator of \\ft — /|| -h 1 is 

/(#) ’ 

where K- denotes the transition kernel for the steps k. ,k-l . ,. . .,1 . of [A. 39] 
(see Problem 12.6). Since the distribution / is typically only known up to a 
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multiplicative constant, the limiting value of Jt is unknown. Thus, the diag- 
nostic based on this unbiased estimation of ||/t — /|| + 1 is to evaluate the 
stabilization of Jt graphically. This method requires both K-{0,6') and the 
normalizing constant of K-{ 6 , 6 ') to be known, as in the method of Ritter 
and Tanner (1992) . 

An additional diagnostic can be derived from the convergence of 

f 

to the same limit as Jt, which is equal to 1 when the normalizing constant 
is available. Variations and generalizations of this method are considered by 
Roberts (1994), Brooks et al. (1997), and Brooks and Roberts (1998). 

From a theoretical point of view, this method of evaluation of the distance 
to the stationary distribution is quite satisfactory, but it does not exactly meet 
the convergence requirements of point (ii) of Section 12.1 for several reasons: 

(a) It requires parallel runs of several Gibbs sampler algorithms and thus 
results in a loss of efficiency in the execution (see Section 12.4). 

(b) The convergence control is based on /t, the marginal distribution of (0^^^), 
which is typically of minor interest for Markov chain Monte Carlo pur- 
poses, when a single realization of ft is observed, and which does not 
directly relate to the speed of convergence of empirical means. 

(c) The computation (or the approximation) of the normalizing constant of 
K- can be time-consuming and stabilizing around a mean value does not 
imply that the chain has explored all the modes of / (see also Section 
12.4). 

Liu et al. (1992) also evaluate convergence to the stationary distribution 
through the difference between /, the limiting distribution, and ft., the dis- 
tribution of Their method is based on an expression of the variance of 
ft{6)/ f{0) which is close to the Jt’s of Roberts (1992). 

Lemma 12.9. Let ^62 ~ /(t-i) independent, and generate 6 \ ~ , 

Oi), 02 A" (^2 5 ^ 2 )- Then the quantity 

m) K{e^,e2) 

/( 02 ) K{6:[,ei) 

satisfies 

E/P1 = w,(M) + i. 

Proof The independence of the 0 ~ ’s implies that 
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E/ 



f{0i) K{e^,02) 

L/(^2) K{6^,e,)\ 



-J 
-I 



m) 

m ) 

ft{02f 



Kie^, 62 ) ft-i{ei)de^deide2 



m ) 



dd2 



ft{02) 

m ) 



-1 f{02)d92 



with 



E/ 



MO2) 



im ) 



= 1 . 



□ 



Given M parallel chains, each iteration t provides M(M — 1 ) values 
{i 7 ^ j). These can be processed either graphically or using the method of 
Gelman and Rubin (1992) (see Section 12.3.4) for M/2 independent values 
(that is, using disjoint pairs (^, j)). Note that the computation of U 
does not require the computation of the normalizing constant for the kernel 
K. 



12.3 Monitoring Convergence of Averages 



12.3.1 A First Illustration 



We introduce this criterion with an artificial setting where the rate of conver- 
gence can be controlled to be as slow as we wish. 

Example 12.10. Pathological beta generator. As demonstrated in Prob- 
lem 7.5 (Algorithm [A.30]), a Markov chain (X^^^) such that 



( 12 . 11 ) 






Y ~ Se(a + 1, 1) with probability 
otherwise 



is associated with the stationary distribution 

f{x) oc V {1 - (1 - x)] = , 

that is, the beta distribution Be{a, 1). (Note that a Metropolis-Hastings al- 
gorithm based on the same pair (/, ^) would lead to accept y with probability 
x^^^ jy^ which is larger than x^^^ but requires a simulation of y before deciding 
on its acceptance.) 

This scheme has interesting theoretical properties in that it produces a 
Markov chain that does not satisfy the assumptions of the Central Limit 
Theorem. (Establishing this is beyond our scope. See, for example, Doukhan 
et al. 1994.) A consequence of this is the “slow” convergence of empirical 
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Fig. 12.13. Convergence of the empirical average of to a = 0.2 for the 

algorithm (12.11) (solid lines) and the Metropolis-Hastings algorithm (dotted lines). 



averages, since they are not controlled by the CLT. For instance, for h{x) = 
^i-a, pigure 12.13 describes the evolution of the empirical average of the 
h{x^^^ys to a (see Problem 12.4), a convergence that is guaranteed by the 
ergodic theorem (Proposition 6.63). As it obvious from the curve, convergence 
is not achieved after 5 x 10® iterations. Figure 12.13 also provides a comparison 
with a realization of the Metropolis-Hastings algorithm based on the same 
pair. Both graphs correspond to identical beta generations of yt in (12.11), 
with a higher probability of acceptance for the Metropolis-Hastings algorithm. 
The graph associated with the Metropolis-Hastings algorithm, while more 
regular that the graph for the transition (12.11), is also very slow to converge. 
The final value is still closer to the exact expectation a = 0.20. 

A simple explanation (due to Jim Berger) of the bad behavior of both 
algorithms is the considerable difference between the distributions Be (0.2, 1) 
and Be(1.2, 1). In fact, the ratio of the corresponding cdf’s is — \jx 

and, although the quantile at level 10% is 10“® for Be(0.2,l), the probabil- 
ity of reaching the interval [0, 10“®] under the Be(1.2, 1) distribution is 10“®. 
Therefore, the number of simulations from Be (1.2, 1) necessary to obtain an 
adequate coverage of this part of the support of Be (0.2, 1) is enormous. More- 
over, note that the probability of leaving the interval [0, 10“®] using [A. 30] is 
less than 10“®, and it is of the same order of magnitude for the corresponding 
Metropolis-Hastings algorithm. || 

Thus, as in the first case, graphical outputs can detect obvious problems 
of convergence of the empirical average. To proceed further in this evalua- 
tion, Yu (1994) and Yu and Mykland (1998) propose to use cumulative sums 
(CUSUM), graphing the partial differences 

t=i 



( 12 . 12 ) 
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Fig. 12.14. Evolution of the Dl criterion for the chain produced in Example 12.10 
and for a = 0.2 and h{x) = (first axis millions of iterations). 



where 

(12.13) = 

t=l 

is the final average. These authors derive a qualitative evaluation of the mixing 
speed of the chain and the correlation between the When the mixing of 

the chain is high, the graph of is highly irregular and concentrated around 
0. Slowly mixing chains (that is, chains with a slow pace of exploration of the 
stationary distribution) produce regular graphs with long excursions away 
from 0. Figure 12.14 contains the graph which corresponds to the dataset of 
Example 12.10, exhibiting a slow convergence behavior already indicated in 
Figure 12.13. (See Brooks 1998c for a more quantitative approach to CUSUM 
M.) 

Figure 12.15 provides the CUSUM analysis of the MCMC sample of Ex- 
ample 12.1 for the parameters and Even more than in the 

raw-plots, the two first parameters do not exhibit a strong stationarity, since 
the CUSUM plots are very regular and spend the entire sequence above 0. We 
can also note the strong similarity between the CUSUM plots of and 
due to the fact that only (3^^^ is identifiable from the model. On the other 
hand, and as expected, the CUSUM plot for is more in agreement 

with a Brownian motion sequence. 

Similarly, for the case of the mixture over the means of Example 9.2, 
Figure 12.16 exhibits a situation when the Gibbs sampler cannot escape the 
attraction of the local (and much lower) mode. The corresponding CUSUMs 
for both means /ii and /i 2 in Figure 12.17 are nonetheless close enough to the 
golden standard of the Brownian motion. This difficulty is common to most 
“on-line” methods; that is, to diagnoses based on a single chain. It is almost 
impossible to detect the existence of other modes of / (or of other unexplored 
regions of the space with positive probability). (Exceptions are the Riemann 
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Fig. 12.15. Evolution of the CUSUM criterion Dl for the chains (top), 
(middle) and (bottom) of Example 12.1. 




Fig. 12.16. Gibbs sample for the two mean mixture model, when initialized close 
to the second and lower mode, for true values fj^i = 0 and ^2 = 2.5, over the log- 
likelihood surface. 

sum indicator (12.9) which “estimates” 1 and can detect probability losses in 
the region covered by the sample, and the related method developed in Brooks 
1998a.) 

12.3.2 Multiple Estimates 

In most cases, the graph of the raw sequence is unhelpful in the detection 

of stationarity or convergence. Indeed, it is only when the chain has explored 
different regions of the state-space during the observation time that lack of 
stationarity can be detected. (Gelman and Rubin 1992 also illustrate the use 
of the raw sequence graphs when using chains in parallel with quite distinct 
supports.) 

Given some quantities of interest E/[/i(0)], a more helpful indicator is 
the behavior of the averages (12.13) in terms of T. A necessary condition 
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Itenalions 

Fig. 12.17. CUSUMS for both means /ii and fj .2 corresponding to the sample of 
Figure 12.16. 

for convergence is the stationarity of the sequence (St), even though the 
stabilizing of this sequence may correspond to the influence of a single mode 
of the density /, as shown by Gelman and Rubin (1992). 

Robert (1995a) proposes a more robust approach to graphical convergence 
monitoring, related to control variates (Section 4.4.2). The idea is to simul- 
taneously use several convergent estimators of Ef[h{6)] based on the same 
chain (0^^^), until all estimators coincide (up to a given precision). The most 
common estimation techniques in this setup are, besides the empirical average 
St^ the conditional (or Rao-Blackwellized) version of this average, either in 
its nonparametric version (Section 7.6.2) for Metropolis-Hastings algorithms, 
or in its parametric version (Section 9.3) for Gibbs sampling, 

t=i 

where is the Markov chain produced by the algorithm. 

A second technique providing convergent estimators is importance sam- 
pling (Section 3.3). If the density / is available up to a constant,^ the impor- 

^ As it is, for instance, in all cases where the Gibbs sampler applies, as shown by 
the Hammersley-Clifford theorem (Section 10.1.3). 
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tance sampling alternative is 

T 
t = l 



where Wt oc f{0^^^)/gt{0^^^) and gt is the true density used for the simulation 
of In particular, in the case of Gibbs sampling, 












If, on the other hand, the chain is produced by a Metropolis-Hastings 

algorithm, the variables actually simulated, ~ be recycled 

through the estimator 

Wt h{ri^*'>) 

t=l 

with Wt oc f . 

The variance decomposition 



(12.14) 



var E wtXt I = ^VSiT{WtXt) 



does not directly apply to the estimators and since the weights Wt 
are known only up to a multiplicative constant and are, therefore, normalized 
by the inverse of their sum. However, it can be assumed that the effect of this 
normalization on the correlation vanishes when T is large (Problem 12.14). 

More importantly, importance sampling removes the correlation between 
the terms in the sum. In fact, whether or not the are correlated, the 
importance-weighted terms will always be uncorrelated. So, for importance 
sampling estimators, the variance of the sum will equal the sum of the vari- 
ances of the individual terms. 



Lemma 12.11. Let a Markov chain with transition kernel q. Then 



var 






t=i 



9(ZW|Z(*-1)) 




q(ZWlZ(*-i)) 



provided these quantities are well defined. 

Proof. Assume, without loss of generality, that Ef[h{Z)] = 0. If we define Wt = 
f{Z^^^)/q{Z^*^\Z^^~^^), the covariance between wth{Z^^^) and Wt-\-rh{Z^^^'^^) 
is 
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where we have iterated the expectation and used the Markov property of 
conditional independence. The second conditional expectation is 

E = j dx 

-E//i(X) = 0, 

showing that the covariance is zero. □ 

The consequences of (12.14) on convergence assessment are twofold. First, 
they indicate that, up to second order, and behave as in the inde- 
pendent case, and thus allow for a more traditional convergence control on 
these quantities. Second, (12.14) implies that the variance of (or of 5^^), 
when it exists, decreases at speed 1/T in stationarity settings. Thus, non- 
stationarity can be detected if the decrease of the variations of does not 
fit in a confidence parabola of order 1 / \/T. Note also that the density / can 
sometimes be replaced (in Wt) by an approximation. In particular, in settings 
where Rao-Blackwellization applies. 



( 12 . 15 ) hie) = 

t=l 

provides an unbiased estimator of / (see Wei and Tanner 1990a and Tanner 
1996). A parallel chain (77^^^) should then be used to ensure the independence 
of fr and of 

The fourth estimator based on (0^*^) is the Riemann approximation (Sec- 
tion 4.3); that is. 



T-l 

(12.16) - ^(t)] H0(t)) mt)) , 

t=i 

which estimates E/[/i(0)], where < • • • < 0 (t) denotes the ordered chain 
(^^^^)i<t<T- This estimator is mainly studied in the iid case (see Proposition 
4.9), but it can, nonetheless, be included as an alternative estimator in the 
present setup, since its performances tends to be superior to those of the 
previous estimators. 

The main drawback of lies in the unidimensionality requirement, the 
quality of multidimensional extensions of the Riemann approximation deceas- 
ing quickly with the dimension (see Yakowitz et al. 1978). When h involves 
only one component of the marginal distribution of this component should 
therefore be used or replaced with an approximation like (12.15), which in- 
creases the computation burden, and is not always available. For extensions 
and alternatives, see Robert (1995a) and Philippe and Robert (2001). 

Example 12.12. Cauchy posterior distribution. For the posterior dis- 
tribution of Example 7.18, 
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tzo OOO 



Fig. 12 . 18 . Comparison of the density (12.17) and the histogram from a sample 
of 20, 000 points simulated by Gibbs sampling, for xi = —8, X 2 = 8, xs =- 17, and 
cr-50. 



(12.17) 7t(0|xi, X 2 , xs) oc [(1 + {6 - xi)‘^) 

X (l-h(l9-X2)^)(l + (6>-X3)^)]~\ 

a Gibbs sampling algorithm which approximates this distribution can be de- 
rived by demarginalization, namely by introducing three artificial variables, 
7^1 , 7/2 , and 7/3 , such that 

'!^{0,riU'n2,'n3\xi,X2,X3) OC g-(l+(e-xi)^)f)i/2 

X g-(l+(^-3^2)^)r/2/2 g-(l+((9-X3)^)ry3/2 ^ 



In fact, similar to the t distribution (see Example 10.1), expression (12.17) 
appears as the marginal distribution of 7 t(0, 7 / 1 , 7 / 2 , ?? 3 |xi, ^ 2 , X 3 ) and the con- 
ditional distributions of the Gibbs sampler (z = 1, 2, 3) 



r]i\6,Xi ~ Sxp 



- Xif 



^1 (r]iXi^r]2X2-^mx3 1 \ 

3 : 1 , X 2 , 3 : 3 , 7 / 1 , 7 / 2 , 7/3 - A/ ^ ^ ^ — ^2 ’ ^ ^ ^ ^ ’ 

V + ^2 + ^3 + <7 ^ r/i -h 7/2 + r/3 + ^ J 



are easy enough to simulate. Figure 12.18 illustrates the efficiency of this 
algorithm by exhibiting the agreement between the histogram of the simulated 
0^*^’s and the true posterior distribution (12.17). 

For simplicity’s sake, we introduce 






Vixi + rj2X2 + 7733^3 
»?i + + % + 0—2 



r '^{vi,'n2,m) = 'ni + m + m + (^ 



and 
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If the function of interest is h{6) = exp{-6/a) (with known a), the dif- 
ferent approximations ofEj^[h{9)] proposed above can be derived. The condi- 
tional version of the empirical mean is 

f exp|-/x (rjf\v 2 \v 3 ^) + {vi\v 2 \v 3 ^) /s} , 
importance sampling is associated with the weights 




i=l 



and the Riemann approximation is 

- %)) J] [1 + (X, - 0p))2]-l 

gR _ 2=1 

e-^(‘)/(2-^) f[[l + (x, - 

t=l i=l 

where < • • • < ^(t) denotes the ordered sample of the 

Figure 12.19 graphs the convergence of the four estimators versus T. As is 
typical, St and are similar almost from the start (and, therefore, their dif- 
ference cannot be used to diagnose [lack of] convergence) . The strong stability 
of 5^, which is the central value of St and 5^, indicates that convergence is 
achieved after a small number of iterations. On the other hand, the behavior 
of S^ indicates that importance sampling does not perform satisfactorily in 
this case and is most likely associated with an infinite variance. (The ratio 
f {9) / fi{9\r]) corresponding to the Gibbs sampling usually leads to an infinite 
variance (see Section 3.3.2), since / is the marginal of fi{9\r]) and has gener- 
ally fatter tails than /i. For example, this is the case when / is a t density 
and fi{’\r]) is a normal density with variance 77.) || 



Example 12.13. (Continuation of Example 12.10) The chain (X^^^) 
produced by (12.11) allows for a conditional version when h{x) = since 
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Fig. 12.19. Convergence of four estimators St (solid line), St (dotted line), St 
(mixed) and St (long dashes, below) of the expectation under (12.17) of h{6) = 
exp(— ^/cr), for = 50 and (xi,X 2 ,X 3 ) = (—8,8, 17). The graphs of St and St are 
identical. The final values are 0.845, 0.844, 0.828, and 0.845 for 5t, St, St, and St 
respectively. 



The importance sampling estimator based on the simulated is 

= E V ^ 

t=i ' t=\ 



and the Riemann approximation based on the same y^^^’s is 

T-l ,T-\ 

‘S't = E (^(‘+1) “ 2^(t)) / E (^(*+1) “ 2^(‘)) ^ 

t=l ' t=l 

^ ^(T) - ^(1) 

ELi (2/(t+i) - ?/(()) 

Figure 12.20 gives an evaluation of these different estimators, showing the 
incredibly slow convergence of (12.1), since this experiment uses 1 million 
iterations. Note that the importance sampling estimator S^ approaches the 
true value a = 0.2 faster than St and S^ {S^ being again indistinguishable 
from St), although the repeated jumps in the graph of S^ do not indicate 
convergence. Besides, eliminating the first 200, 000 iterations of the cheiin does 
not stabilize convergence, which shows that the very slow mixing of the algo- 
rithm is responsible for this phenomenon rather than a lack of convergence to 
stationarity. || 



Both examples highlight the limitations of the method of multiple esti- 
mates: 
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Fig. 12.20. Convergence of four estimators St (solid line), St (dotted line), S^ 
(mixed dashes) and St (long dashes, below) of E[(X^*^)° ®] for the j^e(0.2, 1) distri- 
bution, after elimination of the first 200, 000 iterations. The graphs of St and St 
are identical. The final values are 0.224, 0.224, 0.211, and 0.223 for St, S^, St, and 
S^ respectively, to compare with a theoretical value of 0.2. 



(1) The method does not always apply (parametric Rao-Blackwellization and 
Riemann approximation are not always available). 

(2) It is intrinsically conservative (since the speed of convergence is deter- 
mined by the slower estimate). 

(3) It cannot detect missing modes, except when the Riemann approximation 
can be implemented. 

(4) It is often the case that Rao-Blackwellization cannot be distinguished from 
the standard average and that importance sampling leads to an infinite 
variance estimator. 

12.3.3 Renewal Theory 

As mentioned in Section 12.2.3, it is possible to implement small sets as a 
theoretical control, although their applicability is more restricted than for 
other methods. Recall that Theorem 6.21, which was established by Asmussen 
(1979), guarantees the existence of small sets for ergodic Markov chains. As- 
suming (0^*^) is strongly aperiodic, there exist a set C, a positive number 
£ < 1, and a probability measure v such that 

(12.18) e > eu{B) • 

Recall also that the augmentation technique of Athreya and Ney (1978) con- 
sists in a modification of the transition kernel K into a kernel associated with 
the transition 

^ ly 






I — e 



with probability £ 
with probability (1 — <s). 
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when G C. Thus, every passage in C results in an independent generation 
from ly with probability e. Denote by rc{k) the time of the (A: + l)st visit to C 
associated with an independent generation from (tc(A:)) is then a sequence 
of renewal times. Therefore, the partial sums 

rc(fc+l) 

Sk= Y. 

t=rc(fc) + l 

are iid in the stationary regime. Under conditions (6.31) in Proposition 6.64, 
it follows that the Sk^ satisfy a version of the Central Limit Theorem, 

with K the number of renewals out of T iterations and /ic = ^E/[tc'(2)--tc( 1)]. 
Since these variables are iid, it is even possible to estimate by the usual 
estimator 

(m9) 

where = rc{k + 1 ) - rc(fc). 

These probabilistic results, although elementary, have nonetheless a major 
consequence on the monitoring of MCMC algorithms since they lead to an 
estimate of the limit variance 7 ^ of the Central Limit Theorem (Proposition 
6.64), as well as a stopping rule for these algorithms. This result thus provides 
an additional criterion for convergence assessment, since, for different small 
sets C, the ratios KcOq/T must converge to the same value. Again, this 
criterion is not foolproof since it requires sufficiently dispersed small sets. 

Proposition 12.14. When the Central Limit Theorem applies to the Markov 
chain under study, the variance 7 ^ of the limiting distribution associated with 
the sum (12.1) can he estimated by 

Kc^h/T , 

where Kc is the number of renewal events before time T and is given by 
(12.19). 

Proof. The Central Limit Theorem result 

1 ^ ^ V(0,7^) 

t=l 



can be written 
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, rc(l) Kc 

^ - E/[/r(0)]) + ^ (5, - Afc E/[/iWD 

t=l k=l 

(12.20) f; 

t=rc(Kc) + l 



Under the conditions 

Ef[Xl] < 00 



and E/ 



rc(l) 

E 

t = l 



< oo , 



which are, in fact, the sufficient conditions (6.31) for the Central Limit The- 
orem (see Proposition 6.64), the first and the third terms of (12.20) almost 
surely converge to 0. Since 

1 

^ E -^-^^(0.4) 

and T I Kc almost surely converges to /ic? the average number of excursions 
between two passages in C, it follows that 

1 / 2 \ 

^ E M (o, ^ j , 

which concludes the proof. □ 

The fundamental factor in this monitoring method is the triplet (C, e:, z/) 
which must be determined for every application and which (strongly) depends 
on the problem at hand. Moreover, if the renewal probability e'k{C) is too 
small, this approach is useless as a practical control device. (See Section 6. 7. 2. 3 
and Robert et al. 2002 for further details.) 

Latent variable and missing data setups often lead to this kind of difficulty 
as 5 decreases as a power of the number of observations. The same problem 
often occurs in high-dimensional problems (see Gilks et al. 1998). Mykland et 
al. (1995) eliminate this difficulty by modifying the algorithm, as detailed in 
Section 12.3.3. If we stick to the original version of the algorithm, the practical 
construction of (C^e^u) seems to imply a detailed study of the chain 
and of its transition kernel. Examples 10.1 and 10.17 in Section 10.2.1 have, 
however, shown that this study can be conducted for some Gibbs samplers and 
we see below how a more generic version can be constructed for an important 
class of problems. 

Example 12.15. (Continuation of Example 12.12) The form of the pos- 
terior distribution (12.17) suggests using a small set C equal to [ri,r 2 ] with 
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^2 C [ri,T2], xi < ri, and X3 > T2 (if xi < X2 < ^^3). This choice leads to the 
following bounds: 

Pii =ri-xi < \0 -xi\ < pi 2 = n - xi, 

0 < 1^ - a; 2 | < P 22 = max(r 2 - X 2 ,X 2 - ri), 

P31 =X 3 -r 2 <\ 6 - Xs\ < P32 = X 3 - Ti, 

for ^ G C, which give e and u. In fact, 

K{ 0 ,e')> [ r(77i,7?2,??3)'^ 

X exp|-(6»' -/i(r7i,7?2,r/3))^/(2r^(j?i,»?2,»73))| 

X — exp{-(l + p?2)m/2}^ exp{-(l + p^2)»?2/2} 

1 + P^ 

X — exp{-{l + p12)t} 3/2} dr]id7]2dm 

= i±4i — = ei^ie') , 
f + P12 1 + P22 f + ^§2 

where u is the marginal density of 6 for 



((9, r/i, r/ 2 , r/ 3 ) ~ AT (/i(ryi, 7 ^ 2 , r/s), 7^2, 773 )) 






One can thus evaluate the frequency l/fic of visits to C and calibrate r\ and 
T 2 to obtain optimal renewal probabilities. || 

Another difficulty with this method is that the splitting technique requires 
the generation of variables distributed as 



( 12 . 21 ) 



R { e ^^\0 = 






called the residual kernel. It is actually enough to know the ratio K {6 ^^^ , /su{^) 

since, following from Lemma 2.27, the algorithm 

Algorithm A* 53 -Negative Weight Mixture Simulation- 



1 . Simulate ^ ^ . 

2* Accept ^ with probability 1 — ’ 

otherwise go to 1. 



[i4,53] 





494 12 Diagnosing Convergence 



provides simulations from the distribution (12.21). Since the densities K{6^^) 
and u are in general unknown, it is necessary to use approximations, as in 
Section 12.2.5. 

Example 12.16. (Continuation of Example 12.15) The integral forms 
of K{6,^) and of jy{6) (that is, their representation as marginals of other 
distributions) allow for a Rao-Blackwellized approximation. Indeed, 

1 ^ 
i=l 

and 

1 ^ 

with 



^ exp - fJ.{rii,T] 2 , rj 3 )f /{ 2 T‘^{rn,ri 2 , rjs))} 772 , »?3) 



Tj) ~ Exp 



1 + (^ - Xj^ 



fj) ~ Sxp 



Ph 



(i = 1,2,3) 



provide convergent (in M) approximations of K{9,^) and i/(^), respectively, 
by the usual Law of Large Numbers and, therefore, suggest using the approx- 
imation 

KiO,0 ^ ‘P{^\v\,V2:V3) 

The variables 77 ] must be simulated at every iteration of the Gibbs sampler, 
whereas the variables ff^ can be simulated only once when starting the algo- 
rithm. 

Table 12.1 gives the performances of the diagnostic for different values 
of r, when C = [x 2 — r, X 2 + r]. The decrease of the bound £ as a function 
of r is quite slow and indicates a high renewal rate. It is thus equal to 11% 
for r = 0.54. The stabilizing of the variance 7 ^ occurs around of 1160 for 
h(x) = X, but the number of simulations before stabilizing is huge, since one 
million iterations do not provide convergence to the same quantity, except for 
values of r between 0.32 and 0.54. 1 1 



Mykland et al. (1995) provide a simple and clever way of avoiding the 
residual kernel altogether. Suppose that 6^^^ G C. Instead of drawing 5t r\j B(e) 
and then simulating conditional on (St, 9^^^) (which entails simulation 
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r 


0.1 


0.21 


0.32 


0.43 


0.54 


0.65 


076 


0.87 


0.98 


1.09 


e 


0.92 


0.83 


0.73 


0.63 


0.53 


0.45 


0.38 


0.31 


0.26 


0.22 


hc 


25.3 


13.9 


10.5 


9.6 


8.8 


9.6 


9.8 


10.4 


11.4 


12.7 


Ih 


1135 1138 1162 1159 1162 1195 1199 1149 1109 1109 



Table 12.1. Estimation of the asymptotic variance 7 ^ for h{6) = 6 and renewal 
control for C = [x 2 — r, X 2 + r] with xi = — 8 , X 2 = 8 , X 3 = 17, and ■= 50 (1 

million simulations). 



from •) if St = 0), one first simulates conditional on (in the 

usual way) and then simulates St conditional on Calculation of 

Fr{St — 1\9^^\9^'^~^^^) is straightforward if the densities K and z/ are known 
(Problem 12.11). 

The previous example shows how conservative this method can be and, 
thus, how many iterations it requires. In addition, the method does not guar- 
antee convergence since the small sets can always omit important parts of the 
support of /. 

A particular setup where the renewal method behaves quite satisfactorily is 
the case of Markov chains with finite state-space. In fact, the choice of a small 
set is then immediate: If 0 can be written as {i : z G /}, C = {io} is a small 
set, with io the modal state under the stationary distribution n = If 

P = ) denotes the transition matrix of the chain, 

C = {io}, e = 1, = {Pioi)iei , 

and renewal occurs at each visit in io- Considering in parallel other likely 
states under tt, we then immediately derive a convergence indicator. 

Example 12.17. A finite Markov chain. Consider 0 = {0, 1, 2, 3} and 

/0.26 0.04 0.08 0.62\ 

0.05 0.24 0.03 0.68 
0.11 0.10 0.08 0.71 ’ 

\0.08 0.04 0.09 0.79/ 

The stationary distribution is then n = (0.097, 0.056, 0.085, 0.762), with mean 
2.51. Table 12.2 compares estimators of 7^, with h{9) = 0, for the four small 
sets C = {io} and shows that stabilizing is achieved for T = 500, 000. || 



This academic example illustrates how easily the criterion can be used in 
finite environments. Finite state-spaces are, nonetheless, far from trivial since, 
as shown by the Duality Principle (Section 9.2.3), one can use the simplicity 
of some subchains to verify convergence for the Gibbs sampler. Therefore, 
renewal can be created for a (possibly artificial) subchain ( 2 :^^^) and later 




496 12 Diagnosing Convergence 



T/io 


0 1 


2 


3 


5000 


1.19 1.29 


1.26 


1.21 


500,000 1.344 1.335 1.340 1.343 



Table 12.2. Estimators of 7 ^ for h{x) = x, obtained by renewal in io. 



applied to the parameter of interest if the Duality Principle applies. (See also 
Chauveau et al. 1998 for a related use of the Central Limit Theorem for 
convergence assessment.) 

Example 12 . 18 . Grouped multinomial model. The Gibbs sampler cor- 
responding to the grouped multinomial model of Example 9.8, 

X ^ M 5 {n;aifi-\-bi,a 2 /J^-^b 2 ,asr]-\-bs,a 4 r]-^b 4 ,c{l - /i - 77 )) , 

actually enjoys a data augmentation structure and satisfies the conditions 
leading to the Duality Principle. Moreover, the missing data ( 2 : 1 , ... , Z4) has 
finite support since 



B { X, 



aiji 

aifi + bi 



(i = l,2). 



B Xi, 



aiT] + bi 



(i = 3,4). 



Therefore, the support of is of size (xi + 1) x • • • x (x 4 -hl). A preliminary 

simulation indicates that (0, 1, 0, 0) is the modal state for the chain with 

an average excursion time of 27.1. The second most frequent state is (0, 2, 0, 0), 
with a corresponding average excursion time of 28.1. In order to evaluate 
the performances of renewal control, we also used the state (1, 1,0,0), which 
appears, on average, every 49.1 iterations. Table 12.3 describes the asymptotic 
variances obtained for the functions 

= -I _ • 

l fX T] 

The estimator of hs is obtained by Rao-Blackwellization, 



1 0.5 + zf'> + 



which increases the stability of the estimator and reduces its variance. 

The results thus obtained are in good agreement for the three states under 
study, even though they cannot rigorously establish convergence. || 
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i 




7?(1) 7?(2) 7?(3) 


hi 


5.10”^ 


0.758 0.720 0.789 


/l2 


0.496 


1.24 1.21 1.25 


h3 


0.739 


1.45 1.41 1.67 



Table 12.3. Approximations by Gibbs sampling of posterior expectations and 
evaluation of variances 7^ by the renewal method for three different states (1 for 
(0, 1,0,0), 2 for (0, 2,0, 0), and 3 for (1,1,0, 0)) in the grouped multinomial model 
(500,000 iterations). 



12.3.4 Within and Between Vgiriances 



The control strategy devised by Gelman and Rubin (1992) starts with the 
derivation of a distribution /i related with the modes of /, which are sup- 
posedly obtained by numerical methods. They suggest using a mixture of 
Student’s t distributions centered around the identified modes of /, the scale 
being derived from the second derivatives of / at these modes. With the pos- 
sible addition of an importance sampling step (see Problem 3.18), they then 
generate M chains (0m ) {I < m < M). For every quantity of interest = h{6), 
the stopping rule is based on the difference between a weighted estimator of 
the variance and the variance of estimators from the different chains. 

Define 

1 ^ _ 

M X] ~ ’ 

ra—l 



with 



Wt 



. M M . T 



m=l 



m=l t=l 



1 ^ - 1 ^ - 



t=l 



m=l 



and = h{6m)- The quantities Bt and Wt represent the between- and 
within- chain variances. A first estimator of the posterior variance of ^ is 



Wt -h Bt • 

Gelman and Rubin (1992) compare and Wt, which are asymptotically 
equivalent (Problem 12.15). Gelman (1996) notes that overestimates the 
variance of because of the large dispersion of the initial distribution, 
whereas Wt underestimates this variance, as long as the different sequences 
i^m) remain concentrated around their starting values. 

The recommended criterion of Gelman and Rubin (1992) is to monitor 
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Fig. 12.21. Evolutions of Rt (solid lines and scale on the left) and of Wt (dotted 
lines and scale on the right) for the posterior distribution (12.17) and h{6) = 0 
(M = 100). 



Rt 



^2 I Bt 

~W 

Wt i^T — 2 

f T — 1 M -|- 1 Bt \ i^T 
V T M ) i^t-2 ’ 



where i/t = -h IWt and the approximate distribution of Rt is 

derived from the approximation TBt/Wt ~ T{M — 1, 2W^Iwt) with 







(The approximation ignores the variability due to VTli^T ~ 2).) The stop- 
ping rule is based either on testing that the mean of Rt is equal to 1 or on 
confidence intervals on Rt- 



Example 12.19. (Continuation of Example 12.12) Figure 12.21 de- 
scribes the evolution of Rt for h{0) = 6 and M = 100 and 1000 iterations. As 
the scale of the graph of Rt is quite compressed, one can conclude there is 
convergence after about 600 iterations. (The 50 first iterations have been elim- 
inated.) On the other hand, the graph of Wt superimposed in this figure does 
not exhibit the same stationarity after 1000 iterations. However, the study 
of the associated histogram (see Figure 12.18) shows that the distribution of 
is stationary after a few hundred iterations. In this particular case, the 
criterion is therefore conservative, showing that the method can be difficult 
to calibrate. 1 1 



This method has enjoyed wide usage, in particular because of its simplicity 
and its connections with the standard tools of linear regression. However, we 
point out the following: 
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(a) Gelman and Rubin (1992) also suggest removing the first half of the sim- 
ulated sample to reduce the dependence on the initial distribution fi. By 
comparison with a single-chain method, the number of wasted siuiulations 
is thus (formally) multiplied by M. 

(b) The accurate construction of the initial distribution /i can be quite delicate 
and time-consuming. Also, in some models, the number of modes is too 
great to allow for a complete identification and important modes may be 
missed. 

(c) The method relies on normal approximations, whereas the MCMC algo- 
rithms are used in settings where these approximations are, at best, diffi- 
cult to satisfy and, at worst, not valid. The use of Student’s t distributions 
by Gelman and Rubin (1992) does not remedy this. More importantly, 
there is no embedded test for the validity of this approximation. 

(d) The criterion is unidimensional; therefore, it gives a poor evaluation of 
the correlation between variables and the necessary slower convergence of 
the joint distribution. Moreover, the stopping rule must be modified for 
every function of interest, with very limited recycling of results obtained 
for other functions. (Brooks and Gelman 1998a studied multidimensional 
extensions of this criterion based on the same approximations and Brooks 
and Giudici 2000 introduced a similar method in the special case of re- 
versible jump algorithms.) 

12.3.5 Effective Sample Size 

Even in a stationary setting, there is an obvious difference in the use of the 
empirical average when compared with the standard Monte Garlo estimate of 
Chapter 3. Indeed, using the empirical average (12.13) as the estimator of 

j h{e)f{e)cw, 

we cannot associate the standard variance estimator 

T 

( 12 . 22 ) Ot = - Srf 

t=l 

to this estimator, due to the correlations amongst the As mentioned 

above for Gelman and Rubin (1992) criterion, (12.22) corresponds to the 
“within” variance in Section 12.3.4, and underestimates the true variance of 
the estimator S't- A first and obvious solution is to use batch sampling as in 
Section 12.2.2, but this is costly and the batch size still has to be determined. 
A more standard, if still approximate, approach is to resort to the effective 
sample size^ which gives the size of an iid sample with the same variance 
as the current sample and thus indicates the loss in efficiency due to the use 
of a Markov chain. This value is computed as 




500 12 Diagnosing Convergence 



f^=T/K{h), 

where K{h) is the autocorrelation time associated with the sequence 

oo 

K,{h) = 1 + 2 y^corr . 

t=i 

Replacing T with then leads to a more reliable estimate of the variance, 

Estimating K{h) is also a delicate issue, but there exist some approximations 
in the literature, as in discussed in Section 12.6.1. Also, the software CODA 
(Section 12.6.2) contains a procedure that computes the autocorrelation. This 
notion of effective sample size can be found in Liu and Chen (1995) in the 
special case of importance resampling (Section 14.3.1). 



12.4 Simultaneous Monitoring 

12.4.1 Binary Control 

Raftery and Lewis (1992) (see also Raftery and Banfield 1991) attempt to 
reduce the study of the convergence of the chain (0^^^) to the study of the con- 
vergence of a two- state Markov chain, where an explicit analysis is possible. 
They then evaluate three parameters for the control of convergence, namely, 
/c, the minimum “batch” (subsampling) step, to, the number of “warm-up” 
iterations necessary to achieve stationarity (to eliminate the effect of the start- 
ing value), and T, total number of iterations “ensuring” convergence (giving a 
chosen precision on the empirical average) . The control is understood to be at 
the level of the derived two-state Markov chain (rather than for the original 
chain of interest). 

The binary structure at the basis of the diagnosis is derived from the chain 

by 

, 

where 6 is an arbitrary value in the support of /. Unfortunately, the sequence 
(Z^^^) does not form a Markov chain^ even in the case where 9^^^ has a finite 
support (see Problem 6.56). Raftery and Lewis (1992) determined the batch 
size k by testing if (Z^^^^) is a Markov chain against the alternative that (Z^^^^) 
is a second order Markov chain (that is, that the vector (Z^^^\ is 

a Markov chain). This determination of k therefore has limited theoretical 
foundations and it makes sense, when accounting for the efficiency loss detailed 
in Lemma 12.2, to suggest working with the complete sequence Z^^^ and not 
trying to justify the Markov approximation. 
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If is treated as a homogeneous Markov chain, with (pseudo-) tran- 

sition matrix 

A -a a \ 

\ P 1-/?; ’ 

it converges to the stationary distribution (Problem 6.51) 



= 0 ) = 



p 



p(^(oo) = 1) 



a 



a + (3 ’ * ' a + P' 

It is therefore possible to determine the warm-up size by requiring that 



(12.23) 



= j) - P(Z<°°) = i) 



< e 



for i, j = 0, 1. Raftery and Lewis (1992) show that this is equivalent to (Prob- 
lem 12.16) 

(12.24) to > log j log\l-a-P\. 

The sample size related with the (acceptable) convergence of 

t=to 



to can be determined by a normal approximation of 6t, with variance 

a + p 

1 {2 — a — (3) aj3 
T (a + (5f ' 



If, for instance, we require 



P 



5t — 



a 

q; + /3 



<q 



>e' 



the value of T is (Problem 12.18) 



(12.25) 



T > 



aP{2 -a-p) 1 

q^{a + PY 



e' + r 



This analysis relies on knowledge of the parameters These are un- 

known in most settings of interest and must be estimated from the simulation 
of a test sample. Based on the independent case, Raftery and Lewis (1992) 
suggest using a sample size which is at least 






e' PI 



a/3 



2 ; (a + /?)2 
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Table 12.4. Evolution of initializing and convergence times, and parameters 
(a, (3) estimated after to T iterations obtained from the previous round, in the 
/Be{a -\- 1,1) example. 



Round 


to T(xl0®) 


1 a P 


1 


55 


431 


0.041 0.027 


2 


85 


452 


0.040 0.001 


3 


111 


470 


0.041 0.029 


4 


56 


442 


0.041 0.029 


5 


56 


448 


0.040 0.027 


6 


58 


458 


0.041 0.025 


7 


50 


455 


0.042 0.025 


8 


50 


452 


0.041 0.028 



This method, called binary control^ is quite popular, in particular because 
programs are available (in St at lib) and also because its implementation is 
quite easy.^ However, there are drawbacks to using it as a convergence indi- 
cator: 

(a) The preliminary estimation of the coefficients a and j3 requires a chain 
{9^^^) which is already (almost) stationary and which has, we hope, suf- 
ficiently explored the characteristics of the distribution /. If a and /? are 
not correctly estimated, the validity of the method vanishes. 

(b) The approach of Raftery and Lewis (1992) is intrinsically unidimensional 
and, hence, does not assess the correlations between components. It can 
thus conclude there is convergence, based on the marginal distributions, 
whereas the joint distribution is not correctly estimated. 

(c) Once a and j3 are estimated, the stopping rules are independent of the 
model under study and of the selected MCMC algorithm, as shown by 
formulas (12.24) and (12.25). This generic feature is appealing for an au- 
tomated implementation, but it cannot guarantee global efficiency. 

Example 12.20. (Continuation of Example 12.13) For the chain with 
stationary distribution Be(0.2, 1), the probability that is less than 0.2, 
with E[X0-8] = 0.2, can be approximated from the Markov chain and the 
two-state (pseudo-) chain is directly derived from 

== I(X(*))O-8<0.2 • 

Based on a preliminary sample of size 50, 000, the initial values of a and (3 
are oq = 0.0425 and /?o = 0.0294. These approximations lead to to = 55 and 
T = 430, 594 ior e = q = 0.01 and e' = 0.99. If we run the algorithm for to + T 
iterations, the estimates of a and f3 are then a* == 0.0414 and /^* = 0.027. 

^ In the example below, the batch size k is fixed at 1 and the method has been 
directly implemented in C, instead of calling the St at lib program. 
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Fig. 12.22. Convergence of the mean Qt for a chain generated from [A. 30]. 



Repeating this recursive evaluation of {to^T) as a function of (a,/3) and the 
update of (a,/3) after {to + T) additional iterations, we obtain the results in 
Table 12.4. These exhibit a relative stability in the evaluation of (a, /?) (except 
for the second iteration) and, thus, of the corresponding (to,T)’s. However, 
as shown by Figure 12.22, after 4,500,000 iterations, the approximation of 
p(^o.8 ^ Q 2 ) is equal to 0.64, while the true value is 0.2^-^/®’^ (that is, 0.669). 
This erroneous convergence is not indicated by the binary control method in 
this pathological setup. 11 



12.4.2 Valid Discretization 

A fundamental problem with the binary control technique of Raftery and 
Lewis (1992) in Section 12.4.1 is that it relies on an approximation, namely 
that the discretized sequence ( 2 ;^^^) is a Markov chain. Guihenneuc- Jouyaux 
and Robert (1998) have shown that there exists a rigorous discretization of 
Markov chains which produces Markov chains. The idea at the basis of this 
discretization is to use several disjoint small sets A{ {% = l,...,/c) for the 
chain (0^^^), with corresponding parameters (e^, z/^), and to subsample only at 
renewal times (n > 1). The r^’s are defined as the successive instants when 
the Markov chain enters one of these small sets with splitting, that is, by 

Tji — inf{ t > Tji-i; 3 1 < i < k, G Ai and ~ } ■ 

(Note that the Ai's {i = 1, . . . , /c) need not be a partition of the space.) 

The discretized Markov chain is then derived from the finite-valued se- 

i=l 



quence 
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Fig. 12.23. Discretization of a continuous Markov chain, based on three small sets. 
The renewal events are represented by triangles for 5, circles for (7, and squares for 
D, respectively. {Source: Guihenneuc-Jouyaux and Robert 1998.) 



as The resulting chain is then described by the sequence of 

small sets encountered by the original chain {6^^^) at renewal times. It can be 
shown that the sequence is a homogeneous Markov chain on the finite 

state-space {1, . . . , A;} (Problem 12.21). 

Example 12.21. (Continuation of Example 12.12) Figure 12.23 illus- 
trates discretization for the subchain (^^^^), with three small sets C = [7.5, 8.5], 
derived in Section 12.15, and B — [—8.5, —7.5] and D = [17.5,18.5], which 
can be constructed the same way. Although the chain visits the three sets 
quite often, renewal occurs with a much smaller frequency, as shown by the 
symbols on the sequence. || 

This result justifies control of Markov chains through their discretized 
counterparts. Guihenneuc-Jouyaux and Robert (1998) propose further uses of 
the discretized chain, including the evaluation of mixing rates. (See Cowles and 
Rosenthal 1998 for a different approach to this evaluation, based on the drift 
conditions of Note 6.9.1, following the theoretical developments of Rosenthal 
1995.) 

12.5 Problems 



12.1 The witch’s hat distribution 

Ti{e\y) oc {(1 - <5) lc{e), y € R^ 

when 0 is restricted to the unit cube C == [0, 1]^ , has been proposed by Matthews 
(1993) as a calibration benchmark for MGMG algorithms. 
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(a) Construct an algorithm which correctly simulates the witch’s hat distribu- 
tion. {Hint: Show that direct simulation is possible.) 

(b) The choice of (5, cr, and d can lead to arbitrarily small probabilities of either 
escaping the attraction of the mode or reaching the mode. Find sets of 
parameters ((5, cr, y) for which these two phenomena occur. 

12.2 Establish that the cdf of the statistic (12.2) is (12.3). 

12.3 Reproduce the experiment of Example 12.10 in the cases a = 0.9 and a = 0.99. 

12.4 In the setup of Example 12.10: 

(a) Show that = a when X ~ Be{a, 1). 

(b) Show that < e) = 

(c) Show that the Riemann approximation St of (12.16) has a very specific 
shape in this case, namely that it is equal to the normalizing constant of /. 

12.5 (Liu et al. 1992) Show that if (^f^) and (^ 2 ^^) are independent Markov chains 

with transition kernel K and stationary distribution tt, 

is an unbiased estimator of the L 2 distance between tt and 

= j K{r],0)Tr^~^{'n)dr] , 



in the sense that E[u;t] = 1 + var 



n{0) ■ 



12.6 (Roberts 1994) Show that if (^i*^) and (^ 2 *^) are independent Markov chains 
with transition kernel K and stationary distribution tt. 



t 



X 






satisfies 

(a) E[x‘] = / 1, 

(b) limt-^oo var(xt) = f{K(e‘f’\e) - 7r(6>))^^d6> , 

where irj denotes the density of {Hint: Use the equilibrium relation 

(2t-l) ^(2t-2)x 



to prove part (a).) 






12.7 (Continuation of Problem 12.6) Define 

J ^ ^ ^ Xij 5 ^ ^ Xii ? 

m{m — 1) ^ 

with 






based on m parallel chains (^^^^) {j = 1, . . . ,m). 
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(a) Show that if 6^^^ ~ tt, E[7t] = E[Jt] = 1 and that for every initial distribution 
on the 6>f^’s, E[It] < E[Jt]. 

(b) Show that 



var(/t) < var( Jt)/(m - 1) . 

12.8 (Dellaportas 1995) Show that 



9{x)J 



-I 



= / \f(x) ~g{x)\dx . 



Derive an estimator of the Li distance between the stationary distribution tt 
and the distribution at time t, 7^^ 

12.9 (Robert 1995a) Give the marginal distribution of ^ in Example 

12.18 by doing the following: 

(i) Show that the Jacobian of the transform of (//, v) in (^, v) is 

(ii) Express the marginal density of ^ as a {x\ +X2) degree polynomial in ^/(1 + 

0 - 

(iii) Show that the weights Wj of this polynomial can be obtained, up to a 
multiplicative factor, in closed form. 

Deduce that a Riemann sum estimate of E[/i3(/i, v)] is available in this case. 

12.10 For the model of Example 12.18, show that a small set is available in the 
(yu, 77) space and derive the corresponding renewal probability. 

12.11 In the setup of Section 12.3.3, if 5t is the B{e) random variable associated 

with renewal for the minorization condition (12.18), show that the distribution 
of 5t conditional on is 



Pr oc , 

Pr oc . 



12.12 Given a mixture distribution pA/'(/i, 1) + (1 -p)A^(6>, 1), with conjugate priors 

P^^o,i] , m~A/'( 0 ,t^) , ~ A/'( 0,r^), 

show that [p, p] X [^, p] x [9, 9] is a small set and deduce that e in (12.18) decreases 
as a power of the sample size. 

12.13 For the estimator given in Example 12.13: 

(a) Show that the variance of in Example 12.13 is infinite. 

(b) Propose an estimator of the variance of (when it exists) and derive a 
convergence diagnostic. 

(c) Check whether the importance sampling is (i) available and (ii) with 
finite variance for Examples 10.17 and 9.8. 

12.14 (a) Verify the decomposition 

var(V) = var(E[X|V]) +E[var(V|V)]. 

(b) Show that, up to second order, and behave as in the independent 
case. That is, varS'^ Ylt 'Wtysnh{9^^'>) + 0(l/t). 

12.15 Referring to Sectionl2.3.4: 

(a) Show that and Wt have the same expectation. 
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(b) Show that and Wt are both asymptotically normal. Find the limiting 
variances. 

12.16 (Raftery and Lewis 1992) Establish the equivalence between (12.23) and 
(12.24) by showing first that it is equivalent to 



|1 






( q ^ + f3)e 

ay (3 



12.17 For the transition matrix 



a a 
l-p 



show that the second eigenvalue is A = 1 — a — /?. Deduce that 

= j) — i)| < ^ 



for every (i, j) if \l — a — l3\^ < £{a-{-p)/ max(o;,/3). {Hint: Use the representation 
P{x^^^ = = j) = CjP^Bi, with eo = (1,0) and ei = (0, 1).) 

12.18 (Raftery and Lewis 1992) Deduce (12.25) from the normal approximation of 
St- {Hint: Show that 



\ ^a(3{2-a-l3) ) 2 

12.19 (Raftery and Lewis 1996) Consider the logit model 

log — — =T]-\-6i, ^ A/’(0, (7^) , ~ 0a(O.5, 0.2) . 

1 — TTi 

Study the convergence of the associated Gibbs sampler and the dependence on 
the starting value. 

12.20 Given a three-state Markov chain with transition matrix P, determine suffi- 
cient conditions on P for a derived two-state chain to be a Markov chain. 

12.21 Show that the sequence {^^^^) defined in Sectionl2.4.2 is a Markov chain. 
{Hint: Show that 

P{&^ = = J, . 0 

= E^(0) [U. + ^ 2 - D £ 4^^...] ^ 

and apply the strong Markov property.) 

12.22 (Tanner 1996) Show that, if 7T^ and if the stationary distribution is 

the posterior density associated with f{x\6) and Tr(^), the weight 

_ /(x|0W)7r(0«) 

7r‘(6K‘)) 



converges to the marginal m{x). 

12.23 Apply the convergence diagnostics of CODA to the models of Problems 10.34 
and 10.36. 




508 12 Diagnosing Convergence 



12.24 In the setup of Example 10.17, recall that the chain is uniformly er- 

godic, with lower bound 



r(P') 



j7+10a(^')7+10a-l 

r(10a + 7) 




Pi-f-a 



Implement a simulation method to determine the value of p and to propose an 
algorithm to simulate from this distribution. {Hint: Use the Riemann approx- 
imation method of Section 4.3 or one of the normalizing constant estimation 
techniques of Chen and Shao (1997); see Problem 4.1.) 



12.6 Notes 



12.6.1 Spectral Analysis 

As already mentioned in Hastings (1970), the chain or a transformed chain 

{h{6^^^)) can be considered from a time series point of view (see Brockwell and Davis 
1996, for an introduction). For instance, under an adequate parameterization, we 
can model as an ARMA(p, g^) process, estimating the parameters p and q, and 
then use partially empirical convergence control methods. Geweke (1992) proposed 
using the spectral density of 

^ t=oo 

Sh{w) = — COV e”*"', 

t = — O0 

where i denotes the complex square root of 1 (that is, 

_j_ f sin(tu;) ^ . 

The spectral density is related to the asymptotic variance of (12.1) since the limiting 
variance 7 ^ of Proposition 6.77 is given by 



= SliO) . 



Estimating Sh by nonparametric methods like the kernel method (see Silverman 
1986). Geweke (1992) takes the first Ta observations and the last Tb observations 
from a sequence of length T to derive 






Tb 



t=T-TB + l 



and the estimates and a% of Sh{0) based on both subsamples, respectively. 
Asymptotically (in T), the difference 



(12.26) 



Vt{Sa - Sb) 



^ 



V ta tb 

is a standard normal variable (with Ta = taT, Tb = tbT^ and ta+tb < 1). We can 
therefore derive a convergence diagnostic from (12.26) and a determination of the 
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size to of the training sample. The values suggested by Geweke (1992) are ta = 0.1 
and tb = 0.5. 

A global criticism of this spectral approach also applies to all the methods using 
a nonparametric intermediary step to estimate a parameter of the model, namely 
that they necessarily induce losses in efficiency in the processing of the problem 
(since they are based on a less constrained representation of the model). Moreover, 
the calibration of nonparametric estimation methods (as the choice of the window in 
the kernel method) is always delicate since it is not standardized. We therefore refer 
to Geweke (1992) for a more detailed study of this method, which is used in some 
software (see Best et al. 1995, for instance). Other approaches based on spectral 
analysis are given in Heidelberger and Welch (1983) and Schruben et al. (1983) to 
test the stationarity of the sequence by Kolmogorov-Smirnov tests (see Gowles and 
Carlin 1996 and Brooks and Roberts 1999 for a discussion). Note that Heidelberger 
and Welch (1983) test stationarity via a Kolmogorov-Smirnov test, based on 



(Tt/)(0))1/2 



0 < s < 1, 



where 

t=l t=l 

and V;(0) is an estimate of the spectral density. For large T, Bt is approximately a 
Brownian bridge and can be tested as such. Their method thus provide the theoret- 
ical background to that of Yu and Mykland (1998) CUSUM criterion (see Section 

12.3.1) . 

12.6.2 The CODA Software 

While the methods presented in this chapter are at various stages of their devel- 
opment, some of the most common techniques have been aggregated in an S-Plus 
or R software package called CODA, developed by Best et al. (1995). While originally 
intended as an output processor for the BUGS software (see Section 10,6.2), this 
software can also be used to analyze the output of Gibbs sampling and Metropolis- 
Hastings algorithms. The techniques selected by Best et al. (1996) are mainly those 
described in Cowles and Carlin (1995) (that is, the convergence diagnostics of Gel- 
man and Rubin (1992) (Section 12.3.4), (Section 12.6.1) Geweke (1992) (Section 

12.6.1) , Heidelberger and Welch (1983), Raftery and Lewis (1992) (Section 12.4.1), 
plus plots of autocorrelation for each variable and of cross-correlations between vari- 
ables). The MCMC output must, however, be presented in a very specific format to 
be processed by CODA, unless it is produced by BUGS. 
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Perfect Sampling 



There were enough broken slates underfoot to show that the process was 
unperfect. 

— Ian Rankin, Set in Darkness 



The previous chapters have dealt with methods that are quickly becoming 
“mainstream” . That is, analyses using Monte Carlo methods in general, and 
MCMC specifically, are now part of the applied statistician’s tool kit. How- 
ever, these methods keep evolving, and new algorithms are constantly being 
developed, with some of these algorithms resulting in procedures that seem 
radically different from the current standards 

The final two chapters of this book cover methods that are, compared 
with the material in the first twelve chapters, still in their beginning stages of 
development. We feel, however, that these methods hold promise of evolving 
into mainstream methods, and that they will also become standard tools for 
the practicing statistician. 



13.1 Introduction 

MCMC methods have been presented and motivated in the previous chap- 
ters as an alternative to direct simulation methods, like Accept-Reject algo- 
rithms, because those were not adequate to process complex distributions. 
The ultimate step in this evolution of simulation methods is to devise exact 
(or “perfect”) sampling techniques based on these MCMC algorithms. While 
we seem to be closing the loop on simulation methods by recovering exact 
simulation techniques as in Chapter 2, we have nonetheless gained much from 
the previous chapters in that, as we will see, we are now able to design much 
more advanced Accept-Reject algorithms, that is, algorithms that could not 
have been devised from scratch. This comes at a cost, though, in that these 
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novel algorithms are often quite greedy in both computing time and storage 
space, but this is the price to pay for using a more generic construction in 
the Accept-Reject algorithms. Before starting this chapter, we must stress 
that the following methods, while generic in their principle, are still fairly re- 
stricted in their statistical applications, with the main bulk of perfect sampling 
implementations being found in spatial process problems (Note 13.6.1). 

Chapter 12 details the difficult task of assessing the convergence of an 
MCMC sampler, that is, the validity of the approximation 0^'^^ ~ 7t(j:), 
and, correspondingly, the determination of a “long enough” simulation time. 
Nonetheless, the tools presented in that chapter are mostly hints of (or lack of) 
convergence, and they very rarely give a crystal-clear signal that 0^'^^ ~ 7t(x) 
is valid for all simulation purposes. In fact, when this happens, the corre- 
sponding MCMC algorithm ends up providing an exact simulation from the 
(stationary) distribution 7t{x) and thus fulfills the purpose of the present chap- 
ter. 

There is another connection between Chapters 12 and 13, in that the 
methods developed below, even when they are too costly to be implemented 
to deliver iid samples from tt, can be used to evaluate the necessary computing 
time or the mixing properties of the chain, that is, as put by Fill (1998a,b), 
to determine ‘‘how long is long enough?’^ 

Yet another connection between perfect sampling and convergence assess- 
ments techniques is that, to reduce dependence on the starting value, some 
authors (e.g. Gelman a^nd Rubin 1992) have advocated running MCMC al- 
gorithms in parallel. Perfect sampling is the extreme implementation of this 
principle in that it considers all possible starting values simultaneously, and 
runs the chains until the dependence on the starting value (s) has vanished. 

Following Propp and Wilson (1996), several authors have proposed devices 
to sample directly from the stationary distribution tt (that is, algorithms such 
that ~ 7 t), at varying computational costs^ and for specific distribu- 
tions and/or transitions. The name perfect simulation for such techniques was 
coined by Kendall (1998), replacing the exact sampling terminology of Propp 
and Wilson (1996) with a more triumphant description! 

The main bulk of the work on perfect simulation deals, so far, with finite 
state spaces; this is due, for one thing, to the greater simplicity of these 
spaces and, for another, to statistical physics motivations related to the Ising 
model (see Example 10.2). The extension to mainstream statistical problems 
is much more delicate, but Murdoch and Green (1998) have shown that some 
standard examples in continuous settings, like the nuclear pump failure model 
of Example 10.17 (see Problem 12.24), do allow for perfect simulation. Note 
also that in settings when the Duality Principle of Section 9.2 applies, the 
stationarity of the finite chain obviously transfers to the dual chain, even if 
the latter is continuous. 

^ In most cases, the computation time required to produce 9 ^^^ exceeds by orders 
of magnitude the computation time of a 9 ^^^ from the transition kernel. 
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13.2 Coupling from the Past 

As discussed earlier, a major drawback of using MCMC methods for simu- 
lating a distribution tt is that their validation is only asymptotic. Formally, 
we would need to run the transition kernel an infinite number of iterations to 
achieve simulation precisely from the stationary distribution tt. This is obvi- 
ously impossible in practice, except if we can construct a stopping time r on 
the Markov chain such that the chain at that time has the same distribution 
as one running for an infinite number of steps, 

~ 7r(6») . 

Propp and Wilson (1996) were the first to come up with such a stopping time. 



13.2.1 Random Mappings and Coupling 

In a finite state-space X of size fc, the method proposed by Propp and Wilson 
(1996) is called coupling from the past (CFTP). The principle is to start k 
Markov chains, one in each of the k possible states, far in the past at time 
—To and if all these parallel chains take the same value (or coalesce) by a 
fixed time 0, say, then the starting value is obviously irrelevant. Moreover, the 
distribution of the chain at time 0 is the same as if we had started the iterations 
at time — oo, given that a chain starting at time — oo would necessarily^ take 
one of the k values at time — Tq. 

To implement this principle requires a valid technique that ensures that 
coalescence at time 0 can occur with a large enough probability. For instance, 
if we run the above k chains independently, the probability that they all take 
the same value by time 0, will decrease as a power of k, whatever Tq. We 
thus need to remove the (between) independence assumption and use a cou- 
pling scheme (see Section 6.6.1), which both guarantees that all chains are 
marginally generated from the original transition kernel and that two chains 
that once take the same value remain identical forever after. A convenient 
representation for coupling proceeds through the notion of random mapping: 
given that a Markov chain transition can generically represented as 

where the ut^s are iid from a (fixed) distribution (Borovkov and Foss 1992, see 
Problem 13.3), we can define the successive iterations of an MCMC algorithm 
as a sequence of random mappings on X, 

^ This is not an obvious result since the reverse forward scheme, where ell chains 
started at time 0 from all possible values in X are processed till they all coincide, 
does not produce a simulation from tt (Problems 13.1 and 13.2). 




514 13 Perfect Sampling 



where ^t{0) = ^{9,ut). In particular, we can write 
(13.1) 

which is also called a stochastic recursive sequence. An important feature of 
this sequence of random mappings is that, for a given the successive 
images of 9^^^ by the compositions ^to . . .o^i are correctly distributed from 
the tth transitions of the original Markov chain. Therefore, for two different 
starting values, 9^^ and 9^\ the sequences 

= 0^4 O... 0 1^1 and o . . . o if-i ) 

are both marginally distributed from the correct Markov transitions and, fur- 
thermore, since they are based on the same random mappings, the two se- 
quences will be identical from the time t* they take the same value, which is 
the feature we were looking for. 

Example 13.1. Coalescence. For n G N, a > 0 and /3 > 0, consider the 
joint distribution 

9 ^Be{a,p), X\9 ^ B{n, 9) , 

with joint density 



7t{x,9)(x 

Although we can simulate directly from this distribution, we can nonethe- 
less consider a Gibbs sampler adapted to this joint density with the following 
transition rule at time t: 

1. -- Be{a + Xt, /? + n - xt), 

with corresponding transition kernel 

K{{XuetUxt+l,Ot+i)) OC ( "" 

This is a special case of Data Augmentation (Section 9.2): the subchain (X^) 
is a Markov chain with Xt-\-i\xt ^ BetaBin(n, a + + n — Xt), the beta- 

binomial distribution. 

Consider the following numerical illustration: n — 2^ a — 2 and /? = 4. The 
state space is thus X = {0, 1, 2} and the corresponding transition probabilities 
for the subchain {Xt) are 

Pr(0 0) = .583, Pr(0 ^ 1) = .333, Pr(0 ^ 2) = .083, 

Pr(l 0) = .417, Pr(l ^ 1) = .417, Pr(l ^ 2) = .167, 

Pr(2 ^ 0) = .278, Pr(2 1) - .444, Pr(2 ^ 2) = .278 . 
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Given a random draw Ut+i ~ ZY(0, 1), the paths from all possible starting 
points are described by Figure 13.1. This graph shows, in particular, that 
when wt+i > .917, Ut+i < .278, or UtJ^i G (.583, .722), the chain Xt moves to 
the same value, whatever its previous value. These ranges of thus ensure 
coalescence in one step. || 



2 

1 

0 




2 

1 

0 




2 

1 

0 



wt+i < .278 ut+i G (.278, .417) 



2 

1 

0 



2 

1 

0 




Ut+i G (.417, .583) Ut+i G (.583, .722) 



2 —^ 2 



0-^0 




2 

1 

0 



nt+i G (.722, .833) ut+i G (.833, .917) ut+i > .917 

Fig. 13.1. All possible transitions for the Beta-Binomial(2,2,4) example, depending 
on the range of the coupling uniform variate. 



As in Accept-Reject algorithms, there is a seemingly difficult point with 
the random mapping representation 

6t = 

of an MCMC transition when Ot is generated using a random number of 
uniforms, because the vector ut has no predefined length. This difficulty is 
easily overcome, though, by modifying ^ into a function such that Ut is the seed 
used in the pseudo-random generator at time t (see Chapter 2). This eliminates 
the pretense of simulating, for each t, an infinite sequence of uniforms. 

Random mappings are central to the coupling from the past scheme of 
Propp and Wilson (1996): if we find a time T < 0 such that the composition 
^qo . . is constant^ the value of the chain at time 0 is independent of the 
value of the chain at time T and, by induction, of the value of the chain at time 
— oc. In other words, if we formally start a chain at time — oo, it will end up at 
time 0 with the same value not matter the starting value. Before establishing 
more rigorously the validity of Propp and Wilson’s (1996) scheme, we first 
state a fundamental theorem of coupling. 

Theorem 13.2. For a finite state space X = {1, . . . , A:}, consider k coupled 
Markov chains, . . . , (^t^^)? where 
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(ii) ^ {x[^\ Ut-\-i) , where the Ut’s are mutually independent. 

Then the time r to coalescence is a random variable that depends only on 

An interpretation of the coalescence time r is that it is the time at which 
the initial state of the chain has “worn off” : the distribution of 0r^ is obviously 
the same for all j’s. A natural subsequent question is whether or not is a 
draw from the stationary distribution tt. While it is the case that, for a fixed 
time r* > r, 6^} is distributed from the stationary distribution, the fact that 
r is a stopping time implies that 6r^ is usually not distributed from tt, as 
shown in Example 13.4 (see also Problems 13.1 and 13.2). 

13.2.2 Propp and Wilson’s Algorithm 

The implementation of the above principle thus reduces to a search for a 
constant composition o . . . once one such T < 0 is found, coalescence 
occurs between time T and time 0, and the common value of the chains at 
time 0 is distributed from tt. The algorithm, called coupling from the pasf 
thus reads as follows: 

Algorithm A. 54 -Coupling from the Past- 

L Generate random mappings . . .. 

2. Define compositions (t = -1, -2, . . .) 

^ f (^) = ° ^-1 ^ ^ . 

3. Determine T such that 0r constant by looking successively at 

4^—1 , ^-2p 4p ■ ■ - 

4. For an arbitrary value $o, take ^t(^o) as a realization from tt. 



In Step 3 of this algorithm, the backward excursion stops^ at one time 
T when T>t is a constant function; we are thus spared the task of running 
the algorithm until — oo! The same applies to the simulation of the random 
mappings which only need to be simulated till time T. 

Note that, by virtue of the separate definitions of the maps for a given 
Ot, the value of 0^^^ is also given: not only does this save simulation time but, 
more importantly, it ensures the validity of the method. 

Theorem 13.3. The algorithm [A. 54] produces a random variable distributed 
exactly according to the stationary distribution of the Markov chain. 

^ Note that to establish that is constant is formally equivalent to consider all 
possible starting values at time T and to run the corresponding coupled chains 
till time 0. 
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Proof. Following Casella et al. (2001), the proof is based on the fact that the 
k Markov chains do coalesce at some finite time into one chain, call it 0^. 

First, as each chain (0^^) starting from state j G A' is irreducible, and A! 
is finite, there exists Nj < oo such that, for N > Nj, 

P(6i^) = = j) > 0, for all OeX. 

Then, each chain has a positive probability of being in any state at time 
N > max{iVi, N 2 , . . . , and for some s > 0 

If we now consider the A^th iterate kernel used backward, and set 

Ci = { The k chains coalesce in {—iN, —{i — 1)A^)} , 

under the assumption that all chains are started at time —iN for the as- 
sociated CFTP algorithm, we have that P{Ci) > e. Moreover, the Ci 
are independent because coalescence in — 1)A^) depends only on 

^—iNi ^—iN —It • • • 5 ^ ’ 

Finally, note that the probability that there is no coalescence aft;er I iter- 
ations can be bounded using Bonferroni’s inequality: 

I 

i=l 

Thus, this probability goes to 0 as / goes to cxd, showing that the probability 
of coalescence is 1 (Problem 13.3). 

The result then follows from the fact that the CFTP algorithm starts 
from all possible states. This implies that the realization of a Markov chain 
6 t starting at —00 will, at some time —t, couple with one of the CFTP chains 
and thereafter be equal to 6 ^ . Therefore, 6 q and 60 have the same distribution 
and, in particular, ~ tt. □ 

Example 13.4. (Continuation of Example 13.1). In the setting of the 
beta-binomial model, using the transitions summarized in Figure 13.1, we 
draw Uo- Suppose Uq G (.833, .917). Then the random mapping ^0 is given as 
follows. 



2 

1 

0 



2 

1 

0 



t = -1 t = o 



The chains have not coalesced, so We go to time t = —2 and draw [/_i, with. 
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for instance, U-i G (.278,417). The composition ° is then as follows. 




t = -2 t = -l t^O 

The function l^o o ^-i is still not constant, so we go to time t = — 3. Suppose 
now U -2 ^ (.278, .417). The composition o^_i o ^_2 is given by 

2 \ 2 2 2 



1 \ 1 \ 1 



0 



t = -3 t = -2 t = -1 t = 0 

All chains have thus coalesced into = 1- We accept 9q as a draw from tt . 
Note that, even though the chains have coalesced at t = —1, we do not accept 
0_i = 0 as a draw from tt. || 

Propp and Wilson (1996) suggested monitoring the maps only at times 
— 1, —2, —4, . . ., until coalescence occurs (at time 0), as this updating scheme 
has nearly optimal properties. Note that omitting some times when checking 
for a constant does not invalidate the method. It simply means that we 
may go further backward than necessary. 



13.2.3 Monotonicity and Envelopes 

So far, the applicability of perfect sampling in realistic statistical problems is 
not obvious, given that the principle applies only to a finite state space with 
a manageable number of points. However, an important remark of Propp 
and Wilson (1996) (which has been exploited in many subsequent papers) is 
that the spectrum of perfect sampling can be extended to settings where a 
monotonicity structure is available, namely, when there exists a (stochastic) 
ordering ^ on the state space and when the extremal states for this ordering 
are known. Here, stochastic ordering is to be understood in terms of the 
random mappings namely, that, if < 02 , ^t{0i) < ^t(^ 2 ) for alH’s. 

The existence of monotonicity structures immensely simplifies the imple- 
mentation of the CFTP algorithm and correspondingly allows for extension to 
much more complex problems. For instance, if there exists a (stochastically) 
larger state denoted 1 and a (stochastically) smaller state 0, the CFTP can 
be run with only two chains starting from 1 and 0 : when those two chains 
coincide at time 0, = ^_Ar(l) say, the (virtual) chains starting from 

all possible states in X have also coalesced. Therefore, the intermediary chains 
do not require monitoring and this turns out to be a big gain if X is large. 
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Example 13.5. (Continuation of Example 13.1). Recall that the tran- 
sition kernel on the (sub-)chain Xt is 



K{xt,xt+i) oc 



/ n \ r(g -\-Xt+ Xt+i)r{l3 + n - - Xt+\) 

Ci+i/ r{a-\-Xt)r{(3 + n-xt) 



For x\ > X 2 , 



K{xi,y) _ r{a + xi+ y)r{(3 + n - xi - y)r{a + X2)F{/3 + n-X 2 ) 

K{x 2, y) r{a -\-X2-\- y)r {!3 + n- X2~ y)P{a + xi)r{fi + n- xi) 

_ r{a + X2)r{(3 + n- X 2 ) {a + xi + y - 1) ■ ■ ■ (a + X 2 + y) 

r{a + Xi)r{0 + n - Xi) {P + n - X 2 - y - 1) ■ ■ ■ {p + n - Xi - y) 

is increasing in y. This monotone ratio property implies that the transition 
is monotone for the usual ordering on N and therefore that 6 = 0 and that 
1 = n. It is thus sufficient to monitor the chains started from 0 and from n 
to check for coalescence, for any value of n. This property can be observed in 
Figure 13.1. || 



Example 13.6. Mixtures of distributions. Consider the simplest possible 
mixture structure 

(13-2) pfo{x) + (1 - p)fi{x) , 

where both /o and fi are known, with a uniform prior on 0 < p < 1. For 
a given sample xi, . . . ,x^, the corresponding Gibbs sampler is based on the 
following steps at iteration t. 

1. Generate n tid random variables 

2. Derive the indicator variables as 2 ^*^ = 0 iff 

jt) < 

* “ l{Xi) 

and compute 

i=l 

3. Simulate ^ B€{n + 1 — 1 + 

To design a coupling strategy and a corresponding CFTP algorithm, we first 
notice that can only take n -h 1 possible values, namely, 0, 1, . . . , n. We 
are thus de facto in a finite setting. 

If we recall that aSe(m-hl,n — m-hl) random variable can be represented 
as the ratio 
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m+l / n-\-2 
p=Y^UJi j^UJi 

i=l ' i=l 

for . . . ,c^n +2 iid Sxp{l), we notice in addition that, if we use for all par- 
allel chains a common vector {uJ^\ . . . dd £xp{l) random variables, 

p^^^ is increasing with . We have therefore exhibited a monotone structure 
associated with the Gibbs transition kernel and we thus only need to monitor 
two chains: one started with = 0 and the other with nrS~^^ = n, and 

check whether or not they coalesce by time t = 0. Figure 13.2 (left) illustrates 
the sandwiching argument sketched above: if we start from all possible chains, 
that is, from all possible values of they have all coalesced by the time 
the two “extreme” chains have coalesced. Figure 13.2 (right) provides an ad- 
ditional (if unnecessary) checking that the stationary distribution is indeed 
the distribution of the chain at time t = 0. || 



n = 495 





o.D D.2 a. 4 D.e 



Fig. 13.2. (left) Simulation of n = 495 iid random variables from .33 A/*(3.2, 3.2) -h 
.67A/^(1.4, 1.4) and coalescence at t = —73. (right) lid sample from the posterior 
distribution produced by the CFTP algorithm and fit to the corresponding posterior 
density. {Source: Robert et al. 1999.) 



As stressed by Kendall (1998) and Kendall and Mpller (1999, 2000), this 
monotonicity can actually be weakened into an envelope or sandwiching argu- 
ment: if one can find two sequences {0^) and (0t), generated from a transition 
not necessarily associated with the target distribution tt, such that, for any 
starting value 

(13.3) 

the coalescence of the two bounding sequences 9^ and 6t obviously implies 
coalescence of the (virtual) chains starting from all possible states in X. While 
this strategy is generally more costly in the number of backward iterations. 
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it nonetheless remains valid: the value produced at time 0 is distributed from 
7T. Besides, as we will see below in Example 13.7, it may also save computing 
time. Note also that the upper and lower bounding sequences 6^ and 6t need 
not necessarily evolve within X at every step. This is quite a useful feature 
when working on complex constrained spaces. 

Example 13.7. (Continuation of Example 13.6.) Consider the extension 
of (13.2) to the 3-component case 

(13.4) Plfl{x) -\-p2f2{x) -i-psfsix) , Pi +P2 +P3 = 1 , 



where /i,/ 2,/3 are known densities. If we use a (flat) Dirichlet 2^(1, 1,1) 
prior on the parameter {pi,P 2 ^Ps)y an associated Gibbs sampler for a sample 
xi, . . . , Xn from (13.4) is as follows. 



L Generate lii, , . . , ^ 1). 

2. Take 



Til = f Ui < 

«2 = b { 

i-1 ^ ^ 






Pl/l(^i) +P2f2{Xi) +Pzl3{X: 
^ Plflixi) 






(tii 



xE ( tii < 



Pi /i (aTi ) + P2 /2 (a:; ) + i^3 /3 (a^i ) 

Pi/i(a:,) +P 2 / 2 (a:i) 



Pifii^z) + P 2 / 2 (a:i) + PsM 



a^i))} ’ 



and — n — ni — 

3. Generate (pi,P 2 iP 3 ) ^ + 1,^^2 + 1,^3 -h 1). 



Once again, since (ni,n 2 ) can take only a flnite number of values, the 
setting is finite and we can monitor all possible starting values of (ni,n 2 ) to 
run our CFTP algorithm. However, there are (n + 2)(n + l)/2 such values. The 
computational cost is thus in O(n^), rather than 0{n) for the two component 
case. And there is no obvious monotonicity structure, even when considering 
the exponential representation of the Dirichlet distribution: 



E ni+l 
i=l 






n2 + l 
--1 



E-=iE 



Tlj 1 



(jJj: 






^2i 

7T. j -|- 1 



E- 



na + l 






7 = 1 X/2=l X/7 = l ^i=l 






~ T>(ni4-1, n 2 +l, nsH-l) 



with 



LOli, . . . ,^^l(n+l)7^2l5 • • • 7^2(n+l)7^3l5 • • • , <^3(n+l) ~ £xp{l) . 

While looking only at the boundary cases ni = 0 (i = 1,2,3) seemed to be 
the natural generalization of the case with 2 components, Robert et al. (1999) 
exhibited occurrences where coalescence of the entire range of chains had not 
occurred at coalescence of the boundary chains. 
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In this setting, Hobert et al. ( 1999 ) have, however, obtained an enveloping 
result, namely that the image of the triangle of all possible starting values of 
(ni,ri2), 

T = {(ni,n2); ni+n2 < ri} 

by the corresponding Gibbs transition is contained in the lozenge'^ 

C = {{ni,ri2);ni <rii< ni, ri2 > 0, < n — ni - U2 < n^} , 



where 

- is the minimum of the ni’s over the images of the left border of T ; 

- ns is the ns coordinate of the image of (0, 0); 

- ni is the ni coordinate of the image of (n, 0); 

- is the minimum of the ns’s over the images of the diagonal of T. 

The reason for this property is that, for a fixed n2, 



n 2 + l /ni+1 n— ni— n 2 + l /ni + 1 

pj = £ ^ wii and g = «"3i/ wu 



2=1 ' 2=1 2-:l 

are both decreasing in ni. And this is also true for 



2=1 



mi = I [ < 

2=1 V 



1 + 



P2f2{Xj) -\-p3f3{Xj) 
Plfl{Xi) 



-l-l> 



Besides, this boxing property is preserved over iterations, in that the image 
of C is contained in 

C' = {(mi, m2); nil — '^1 £ ’ ^2 ^ 0 , m3 < ms < ms}, 

where 

- mi is the minimum ni over the images of the left border {ni = ni}; 

- mi is the maximum ni over the images of the right border |ni = ni}; 

- m3 is the minimum ns over the images of the upper border (ns =113}', 

- m3 is the maximum ns over the images of the lower border (ns = ^3}- 

Therefore, checking for coalescence simplifies into monitoring the successive 
lozenges C until they reduce to a single point. Figure 13.3 represents a few 
successive steps of this monitoring, with the original chains provided for com- 
parison. It illustrates the obvious point that coalescence will consume more 
iterations than direct monitoring of the chain, while requiring only a 0(n) 
computational effort at each iteration, compared with the brute force O(n^) 
effort of the direct monitoring. || 



^ A lozenge is a figure with four equal sides and two obtuse angles and, more 
commonly, a diamond-shaped candy used to suppress a cough, a “cough-drop.” 
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Fig. 13.3. Nine successive iterations of the coupled Gibbs sampler for the chains 
associated with all possible starting values (ni,n 2 ), along with the evolution of 
the envelope £, for n = 63 observations simulated from a .12 0.49) + 
.76 AT(3.2, 0.25) + .12 ^7(2.5, 0.09) distribution. {Source: Robert et al., 1999.) 



Kendall (1998) also noticed that the advantages of monotonicity can be 
extended to anti-monotone settings, that is, when 

^{Oi^u) ^^{ 02 -)U) when O 2 :< 0i . 

Indeed, it is sufficient to define the lower and upper bounding sequences, {6^) 
and (^t), as 



and 9t^i — , 

since the lower bound 6^ is transformed into an upper bound by the transform 
and vice versa. 

13.2.4 Continuous States Spaces 

The above extension, while interesting, does not seem general enough to cover 
the important case of continuous states spaces. While Section 13.2.5 will show 
that there exists a (formally) generic approach to the problem, the current 
section illustrates some of the earlier and partial attempts to overcome the 
difficulty of dealing with continuous states spaces. 
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First, as seen in Examples 13.1 and 13.6, it is possible to devise perfect 
sampling algorithms in data augmentation settings (see Section 9.2.1), where 
one of the two chains has a finite support. In Example 13.1, once X is produced 
by the perfect sampling algorithm, the continuous variable 6 can be generated 
from the conditional Be{a + x, /? + n — x). (This is another application of the 
duality principle given in Section 9.2.3.) 

A more general approach to perfect sampling in the continuous case relies 
on the Kac’s representation of stationary distributions (Section 6.5.2) 

oo 

7r(^) = y;Pr(iVte^)Pr(T* = i), 

t=l 



where the kernel satisfies a minorizing condition K{x,y) > ei'{y)lc{x)^ Ta is 
the associated renewal stopping rule, T* is the tail renewal time 



Pr(T* = t) 



Pra(rg > t) 
Ea(ra) 



and Nt has the same distribution as given X\ ~ u{-) and no regeneration 
before time t. 

This was first applied by Murdoch and Green (1998) to uniformly er- 
godic chains (for example, Metropolis-Eastings algorithms with atoms, in- 
cluding the independent case (see Section 12.2.3)). When the kernel is uni- 
formly ergodic, Doeblin’s condition holds (Section 6.59) and the kernel satisfies 
X{x,y) > eu{y) for all x G T. Kac’s representation then simplifies to 



+ 00 



ey{K^iy)ix) 



i=0 



where K denotes the residual kernel 

1 — £ 

The application of this representation to perfect sampling is then obvious: 
the backward excursion time T is distributed from a geometric distribution 
Qeo{e) and a single chain needs to be simulated forward from the residual 
kernel T times. The algorithm thus looks as follows. 



Algorithm A, 55 —Kac’s Perfect Sampling— 



L Simulate xq v, u) ^ Qeo{e). 




2, Run the transition Xt 4 .i ^ K(xc, y) t = 0, - 
and take as a realization from tt. 
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See Green and Murdoch (1999) for other extensions, including the multi- 
gamma coupling. Robert and Robert (2004) explore the extension of the uni- 
formly ergodic case to the general case via Kac’s representation of stationary 
distributions. 

There is, however, a fundamental difficulty in extending CFTP from finite 
state spaces to general (countable or uncountable) spaces. If 

= inf {t G N : ^t(o;) = for all uo.uj'} 

denotes the stopping time associated with [A. 54], Foss and Tweedie (1998) 
established the following result. 

Lemma 13.8. The stopping time is almost surely finite, 

Pr(r* < oo) - 1 , 

if and only if the Markov kernel is uniformly ergodic. 

Proof. The sufficient part of the lemma is obvious: as shown above, the stop- 
ping time then follows a geometric distribution Qeo{e). 

The necessary part is also straightforward: if Ct denotes the event that 
coalescence has occurred by time t, £t — Pr(Ct), and Ut is the probability 
distribution of conditional on Ct, then Doeblin’s condition is satisfied: 

And Pr(T^ < oo) = 1 implies there exists a t < oo such that £t > 0. □ 

The message contained in this result is rather unsettling: CFTP is not 
guaranteed to work for Markov kernels that are not uniformly ergodic. Thus, 
for instance, it is useless to try to implement [A. 54] in the case of a random 
walk proposal on an unbounded state space. As noted in Foss and Tweedie 
(1998), things are even worse: outside uniformly ergodic kernels, CFTP does 
not work, because the backward coupling time is either a.s. finite or a.s. infi- 
nite. 

Lemma 13.9. IfT* is the stopping time defined in [A. 54] as 
T^ = inf {T; constant } , 

then 

Pr(T* - oo) G {0, 1} . 

The proof of Foss and Tweedie (1998) is based on the fact that the event 
{T^ = 00 } is invariant under the transition kernel and a “0-1” law applies. 
That is, that Pr(T* = oo) is either 0 or 1. 

Corcoran and Tweedie (2002) point out that this does not imply that 
backward coupling never works for chains that are not uniformly ergodic: if 
bounding processes as in (13.3) are used, then coupling may occur in an almost 
surely finite time outside uniform ergodic setups, as exemplified in Kendall 
(1998). Nonetheless, situations where uniform ergodicity does not hold are 
more perilous in that an unmonitored CFTP may run forever. 
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13.2.5 Perfect Slice Sampling 

The above attempts, and others in the literature, remain fragmentary in that 
they either are based on specialized results or suppose some advanced knowl- 
edge of the transition kernel (for example, a renewal set and its residual ker- 
nel).^ In this section, we present a much more general approach which, even 
if it is not necessarily easily implement able, offers enough potential to be 
considered as a generic perfect sampling technique. 

Recall that we introduced the slice sampling method in Chapter 8 as a 
general Gibbs sampling algorithm based on the use of auxiliary variables with 
uniform distributions of the type 

U ([0,7Ti(6>)]) . 

In the special case of the univariate slice sampler, also recall that there is a sin- 
gle auxiliary variable, u ~ ^/([0, 1]), and that the simulation of the parameter 
of interest is from the distribution 

U ({a; ; 7r(it;) > U7r(u;o)}) • 

In this case the slice sampler is the random walk version of the fundamental 
theorem of simulation. Theorem 2.15. 

Mira et al. (2001) realized that the fundamental properties of the slice 
sampler made it an ideal tool for perfect sampling. Indeed, besides being 
independent of normalizing constants, the slice sampler induces a natural order 
on the space of interest, defined by tt, the target distribution: 

(13.5) uji :< UJ 2 if and only if 7 t(c(;i) < '^{ 0 ^ 2 ) • 

This is due to the fact that if 7r(ct;i) < 7r(c<;2), and if u ~ ^([0, 1]), the associ- 
ated sets satisfy: 

(13.6) A 2 — { 00 ; 7v{lu) > uti{uj2)} C = {ct; ; 7r{u) > U7r(o;i)} . 

Therefore, if one simulates first the image of uji as ~ U(Ai), this value can 
be proposed as image of LO 2 and accepted if uj'i E A 2 , rejected otherwise. In 
both cases, < 7t(cc; 2), and this establishes the following result: 

Lemma 13.10. The slice sampling transition can he coupled in order to re- 
spect the natural ordering (13.5). 

There is even more to the slice sampler. In fact, as noticed by Mira 
et al. (2001), slice samplers induce a natural discretization of continuous state 
spaces. Using the inclusion (13.6), it is indeed possible to reduce the number 

^ Once again, we stress that we exclude from this study the fields of stochastic 
geometry and point processes, where the picture is quite different. See Mpller 
(2001, 2003) and Mpller and Waagepetersen (2003). 
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of chains considered in parallel from a continuum to a discrete and, we hope, 
finite set. To do this, we start from the minimal state 0 = argmin7r(cj) and a 
uniform variate u, and generate the image of 0 by the slice sampler transition, 

UJq ~ U ({cj ; 7r(a;) > U7 t(0)}) . 

This is straightforward since {a;; 7r(a;) > tx7r(0)} is the entire parameter set. 
(Obviously, this requires the parameter set to be bounded but, as noticed 
earlier, a parameter set can always be compact ified by reparameterization, as 
shown in Example 13.11 below.) Once loq is generated, it is acceptable not 
only as the image of 0 but also of all the starting values 6 such that 

7r(o;o) ^ U7t{6) . 

The set of starting values {6; tt{ 6) < 7r{iAJo)/u} is thus reduced to a single image 
by this coupling argument. Now, consider a value 6i (which can be arbitrary, 
see Problem 13.10) such that 7t{9i) = 7t{(j0q)/u and generate its image as 

uji ({cc; ; 7r(a;) > u'k{6i)]) . 

Then 7t{loi) > 7t{ujo) and ui is a valid image for all 0’s such that 7^{^JUo)/u < 
7t(0) < 7t{uji)/u. We have thus reduced a second continuum of starting values 
to a single chain. This discretization of the starting values continues until the 
maximum of tt, if it exists, is reached, that is, when 7t{ujj)/u > max7r(o;). 
If 7T is unbounded, the sequence of starting values luj {j > 0) is infinite (but 
countable) . 

Example 13.11. Truncated normal distributions While we saw, in Ex- 
ample 2.20, an Accept-Reject algorithm for the simulation of univariate nor- 
mal distributions, the multivariate case is more difficult and both Geweke 
(1991) and Robert (1993) suggested using Gibbs sampling in higher dimen- 
sions. Consider, as in Philippe and Robert (2003), a multivariate 
distribution restricted to the positive quadrant 

Q+ = {x G > 0, i = 1, . . . ,p} = (R+)^ , 

which is bounded by (27t|Z'|)~^/^ dit 1 — /jl. While /i may well be outside the 
domain Q+, this is not relevant here (by the sandwiching argument). 

The problem, in this case, is with 0: the minimum of the truncated normal 
density is achieved at cxd and it is obviously impossible to start simulation from 
infinity, since this would amount to simulating from a “uniform distribution” 
on Q+. A solution is to use reparameterization: the normal distribution trun- 
cated to Q+ can be transformed into a distribution on [0, 1]^ by the function 



h{x) 



Xi Xp 

1 Xi 1 -f- Xp 



z , 



with the corresponding density 




528 13 Perfect Sampling 



7t(z|/x, i7) oc exp |-(/i ^{z)-/j.)^U ^(z)-/x)/2|^ 






With this new parameterization, 1 has to be found, analytically or nu- 
merically (because of the additional Jacobian term in the density), while 
0 = (1, . . . , 1). In the special case when /x = (—2.4, 1.8) and 



r = 



f 1.0 -1.2\ 

\-1.2 4.4 ) ' 



i = (0, 1.8/2. 8). See Philippe and Robert (2003) for the implementation de- 
tails of perfect sampling in this setting. || 



When 7T is bounded and the maximum value 1 = argmax 7 r(a;) is known 
(see Problem 13.14 for an extension), an associated perfect sampling algorithm 
thus reads as follows. 

Algorithm A* 56 -Perfect Monotone Slice Sampler- 
At time —T 

1. Start from 6. 

2. Generate ^ 1]). 

3. Generate Wg ’ uniformly over the parameter set. 

4. Generate the images as f = — T + 1, , , , , 0 

U ({ci>; ;r(tj) > . 

5. Set = i. 

6. For t = -T , ... ,0. 
if Wg~*^ satisfies 

there is coalescence at time 0; 
if r^ot, generate as 

U ^{tj; 7r(w) > . 

7. If there is no coalescence at time 0^ increment T. 

Example 13.12. Exponential mixtures For an exponential mixture 
pSxp{Xo) -f (1 - p) £xp{Xi ) , 

where 0 < p < 1 is known, a direct implementation of slice sampling is quite 
difficult, since simulation from 
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U ( < (Ao, Ai) ; [P^o exp(-XiAo) + (1 - p)Ai exp(-XiAi)] > (• 



2=1 



is impossible. However, using the missing data structure of the mixture model 
(see Example 9.2) 



n 

z, Ao, Ai I X ~ 7r(Ao,Ai) A^, exp(-a;iA 2 j , 

2=1 



where Zi G {0, 1}, it is feasible to come up with a much simpler version. The 
idea developed in Casella et al. (2002) is to integrate out the parameters, i.e., 
(Aq,Ai), and use the slice sampler on the marginal posterior distribution of 
the missing data vector, z. This has the obvious advantage of re-expressing 
the problem within a finite state space. 

If the prior on A^ is a Qa{ai^(3i) distribution, the marginal posterior is 
available in closed form. 



(13.7) 



z I X 



p«0(l _p)«l 



r(ao + Tip - l)r{ai + rii - 1) 

(/?o + so)“'>+”“(/3i 



where {i = 0, 1) 



n n 

^ hj =i and Si = '^lz^=iXj . 
j = l j = l 



It thus factors through the sufficient statistic (no,5o). 

In addition, the extreme values 0 and 1 are available. For every no, sq 
belongs to the interval [5o(no), so(no)], where 

no no 

.^o(^o) ^ V ^(j) and 5o(no) ^ > ^(n+i— j) ’ 

j=l j=l 

in which the X(^)’s denote the order statistics of the sample xi, . . . , Xn, with 
^( 1 ) ^ < ^(n)* The minimum of the marginal posterior is achieved at 

(no + o;o)(/?i +*5) - (n-no + ai)/3o 
5q(^o) — ; ; ? 

71 + Q;o <^1 

where S is the sum of all observations, provided this value is in [5o(no). 5o(no)]. 
The maximum is obviously attained at one of the two extremes. The points 1 
and 0 are then obtained by (n-f 1) comparisons over all no’s. Note that, even 
though So(no) is not necessarily a possible value of sq, the sandwiching argu- 
ment introduced in Section 13.2.3 validates its use for studying coalescence. 

This scheme is only practical for small values of n (say n < 40) , because of 
the large number of values of so, namely, . Figure 13.4 gives an illustration 
of its behavior for n = 40. || 
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Fig. 13.4. Coupling from the past on the distribution (13.7) for a fixed no = 35, 
for n = 40 simulated observations, with the sequence of Sq^^s for both the upper 
and the lower chains, and the range of the marginal posterior in the background. 
(Source: Casella et al. 2002.) 



13.2.6 Perfect Sampling via Automatic Coupling 

In addition to slice sampling, there is another approach to perfect sampling, 
which is interesting from a theoretical point of view, as it highlights the con- 
nection with Accept-Reject algorithms. It is based on Breyer and Roberts 
(2000a) automatic coupling. 

We have two chains (uJq^) and (a;^^^), where the first chain, is run fol- 

lowing a standard Metropolis-Hastings scheme with proposal density q(u;'\uj) 
and corresponding acceptance probability 

The transition for the second chain, (c<;^^^), is based on the same proposed value 
a;', with the difference being that the proposal distribution is independent of 
the second chain and is associated with the acceptance probability 

7r(wf^)g(a;'|wo‘^) 

If uj' is not accepted by the second chain, it then stays the same, 

Note that, keeping in mind the perfect sampling perspective, we can use the 
same uniform variable Ut for both transitions, thus ensuring that the chains 
coalesce once they both accept a proposed value uo' . 
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The implications of this rudimentary coupling strategy on perfect sampling 
are interesting. Consider the special case of an independent propo-sal with 
density h for the first chain Starting from with candidate yt, the 

above acceptance probabilities simplify to 






Ayt)h{x^o^) 

A^o^)Hyt) 



A 1 



and 



(t) 

Qi 



Trjyt) 

7r(xf^)%t) 



A 1 . 



We can then introduce the (total) ordering associated with 7 t//i, that is, 



uq :< uji 



7t(u;o) ^ 7t{uJi) 

h{uJo) ~ h{uJi) ’ 



and observe that the above coupling scheme preserves the ordering (Problem 
13.9). This property, namely that the coupling proposal preserves the ordering, 
was noted by Corcoran and Tweedie (2002). 

If we are in a setting where there exist a minimal and a maximal element, 
0 and 1 , for the order we can then run a perfect sampler based on the two 
chains starting from 0 and 1. Note that, since the choice of h is arbitrary, it can 
be oriented towards an easy derivation of 0 and 1. Nonetheless, this cannot 
be achieved in every setting, since the existence of 1 implies that the ratio 
7 t//i is bounded, therefore that the corresponding MCMC chain is uniformly 
ergodic, as shown in Theorem 7.8. In the special case where the state space 
^ is compact, h can be chosen as the uniform distribution on X and the 
extremal elements 0 and 1 are then those induced by tt. 



Example 13.13. (Continuation of Example 13.12) This strategy applies 
in the case of the mixture model, since the pair (no, sq) evolves in a compact 
(finite) state. As an illustration, we fix the value of no and simulate (exactly) 
5o from the density proportional to 

QOO') r(ao + no) r{ai + n - np) 

(/Jo + (/3i + S' - 

on the finite set of the possible values of 



no 

50 = Xij . 

j=l 



This means that, starting at time — T from 6, we generate the chain 
forward in time according to a regular Metropolis-Hastings scheme with uni- 
form proposal on the set of possible values of so- The chain (ct;^^^), starting 
from 1 , is then coupled to the first chain by the above coupling scheme. If 
coalescence had not occurred by time zero, we increase T. 

Figure 13.5 illustrates the rudimentary nature of the coupling: for almost 
1,000 iterations, coupling is rejected by (cjf^), which remains equal to the 
original value 1 , while (cJq^^) moves with a good mixing rate. Once coupling 
has occurred, both chains coalesce. || 
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-2000 -1500 -1000 -500 O 



Fig. 13.5. Sample path of both chains (cJq*^) and for a successful CFTP 

in a two-component exponential mixture simulated sample (n = 81 and no = 59). 
{Source: Casella et al. 2002.) 

The above example shows the fundamental structure of this automatic 
CFPT algorithm: it relies on a single chain, since the second chain is only 
used through its value at 1, that is, max n {6) /h{6). At the time the two 
chains merge, the simulated value from h satisfies 

7r{ujQ^)/h{ujQ^) < ut max{'K{6)/h{9)} . 

It is therefore exactly distributed from tt, since this is the usual Accept-Reject 
step.^ Therefore, in the case of independent proposals, the automatic CFPT 
is an Accept-Reject algorithm! See Corcoran and Tweedie (2002) for other 
and more interesting extensions. We now consider a second and much less ob- 
vious case of an Accept-Reject algorithm corresponding to a perfect sampling 
algorithm. 



13.3 Forward Coupling 

Despite the general convergence result that coalescence takes place over an 
almost surely finite number of backward iterations when the chain is uniformly 
ergodic (see Section 13.2.3), there are still problems with the CFTP algorithm, 
in that the running time is nonetheless potentially unbounded. Consider the 
practical case of a user who cannot wait for a simulation to come out and 
decides to abort runs longer than To, restarting a new CFTP each time. 
This abortion of long runs modifies the output of the algorithm and does not 
produce exact simulations from tt (Problem 13.16). To alleviate this difficulty. 
Fill (1998a) proposed an alternative algorithm which can be interrupted while 
preserving the central feature of CFTP. 

Fill’s algorithm is a forward-backward coupling algorithm in that it con- 
siders a fixed time horizon in the future, i.e., forward, and then runs the 

® This is a very special case when the distribution at coalescence is the stationary 
distribution. 
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Markov chain backward to time 0 to see if this value is acceptable. It relies on 
the availability of the reverse kernel iC, namely, 



(13.9) 



K{uj,uj') 



Tr{u)')K{uj' ,u>) 



which corresponds to running the associated Markov chain backward in time.^ 
It also requires simulation of the variable U in the random mapping ^ con- 
ditional on its image, that is, conditional on uj' = for fixed uj and 

u'. 

Let g^{'\(jO,uj') denote the conditional distribution of U. A first version of 
this algorithm for a finite state space is the following. 



Algorithm A *57 -Fill’s Perfect Sampling Algorithm- 

1. Choose 2 time T and a state xt = 

2. Generate Xt^i\xt. Ar-2l^r-ip ■■ Ao|a:i from the reverse 
kernel K, 

3. Generate Ui ^ j:i), U 2 ~ 

Ut ^ g^iulxT-i^^cr) and derive the random mappings = 

'), . . * ■)* 

4. Check whether the composition o ■ - o is constant 

5. If so, then accept xq as a draw from tt. 

6. Otherwise begin again, possibly with new values of T and 2 . 

Lemma 13.14. The algorithm [A. 57] returns an exact simulation from the 
stationary distribution n. 



There are several ways to prove that [A. 57] is correct. Problem 13.17 gives 
one proof, but perhaps the most intuitive approach is to establish that [A. 57] 
is, in essence, a rejection algorithm. 



Proof. Let Ct{z) denote the event that all chains have coalesced and are in 
state z at time T. The algorithm [A. 57] is a rejection algorithm in that it 
generates Xq = x, then accepts x as a draw from tt if Ct{z) has occurred. 
The associated proposal distribution is the T-step transition density K'^{z^ •). 
The validity of the algorithm is established if x is accepted with probability 



1 7t(x) 

M K^{z,x) 



where M > sup . 

X K'^{z,x) 



Prom (13.9), 7t(x)/A^(z, x) = 7t(z)/A^(x, z). In addition, P[Ct{z)] < 
iC^(x',z) for any x', since the probability of coalescence is smaller than 
the probability that the chain reaches z starting in x'. Hence P\Ct{z)] < 
min^' A^(x', z) and we have the bound 



^ If detailed balance (Definition 6.45) holds, K — K. 
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max 

X 



7t(x) 

K'^(z, x) 



= max 

X 



K^{x, z) 



^ tt{z) 

m\Hx' K'^{x\ z) ~ P[Ct{z)] ' 



So set M = p[Cr( 2 )] to run a correct Accept-Reject algorithm, we should 
accept Xq = X with probability 

1 ir{x) _ P[Ct{z)] 7t{x) _ P[Ct{z)] n{z) _ PlCriz)] 

Mk^{z,x) tt{z) KT{z,x) 7 t ( 2 ) KT^{x,z) KT{x,zY 



where we have again used detailed balance. And, as detailed in Problem 13.17, 
we have that 



P[Ct{z)] 

K^ix, z) 



P[Ct{z)\x ^ z] , 



where {x ^ z} denotes the event that Xq = x given that Xt = z. This is 
exactly the probability that [A. 57] accepts the simulated Xq = x. □ 



Finally, note that the Accept-Reject algorithm is more efficient if M is 
smaller, so choosing z to be the state that minimizes 7r{z)/P[CT{z)] is a good 
choice. This will be a difficult calculation but, in running the algorithm, these 
probabilities can be estimated. 

Example 13.15. (Continuation of Example 13.1) To illustrate the be- 
havior of [A. 57] for the Beta-binomial example, we choose T = 3 and Xt = 2. 
Since the chain is reversible, K = with probabilities given by Figure 13.1. 
Suppose that (X 2 , Xi,Xo) = (1,0, 1). The following picture shows the corre- 
sponding transitions. 



2 

1 

0 



2 

1 

0 



2 

1 

0 



2 

1 

0 



t =z 0 t = 1 t = 2 t = 3 



The corresponding conditional distributions on the uniforms are thus 

Ui ~ U(0, .417) , U2 - ZY(.583, .917) , and U3 U(.833, 1) , 

according to Figure 13.1. Suppose, for instance, that Ui G (.278, .417), U 2 G 
(.833, .917) and U 3 > .917, in connection with Figure 13.1. 

If we begin chains at all states, 0, 1 and 2, the trails of the different chains 
through time ^ = 3 are then as follows. 




t =: 0 



t = l 



t = 2 



t = 3 
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The three chains coalesce at time 3; so we accept Xq = 1 as a draw from tt. || 



The algorithm [A. 57] depends on an arbitrary choice of T and Xt> This is 
appealing for the reasons mentioned above, namely, that the very long runs 
can be avoided, but one must realize that the coalescence times for CFTP 
and [A. 57] are necessarily of the same magnitude. With the same transition 
kernel, we cannot expect time gains using [A. 57]. If anything, this algorithm 
should be slower, because of the higher potential for rejection of the entire 
sequence if T is small and of the unnecessary runs if T is too large. 

An alternative to [A. 57] is to generate the I/j’s unconditionally, rather 
than conditional on Xi = ^{xi-i^U). (Typically, Ui ~ C/(0, 1).) This requires 
a preliminary rejection step, namely that only sequences of C/^’s that are com- 
patible with the path from Xq = xq to Xt = x are accepted. This does not 
change the distribution of the accepted Xq and it avoids the derivation of the 
conditional distributions but it is only practical as long as the conditional 
distributions are not severely constrained. This is less likely to happen if T is 
large. 

The extensions developed for the CFTP scheme, namely monotonicity, 
sandwiching (Section 13.2.3), and the use of slice sampling (Section 13.2.5), 
obviously apply to [A. 57] as well. For instance, if there exist miniraum and 
maximum starting values, 6 and 1, the algorithm only needs to restart from 0 
and i, once Xq has been generated. Note also that, in continuous settings, the 
conditional distributions will often reduce to a Dirac mass. In conclusion, 
once the basic forward-backward (or rather backward-forward) structure of 
the algorithm is understood, and its Accept-Reject nature uncovered, there is 
little difference in implementation and performance between [A. 54] and [A. 57]. 



13.4 Perfect Sampling in Practice 

With one notable exception (see Note 13.6.1), perfect sampling has not yet 
become a standard tool in simulation experiments. It still excites the interest 
of theoreticians, maybe because of the paradoxical nature of the object, but 
the implementation of a perfect sampler in a given realistic problem is some- 
thing far from easy. To find an adequate discretization of the starting values 
often requires an advanced knowledge of the transition mechanism, and also 
a proper choice of transition kernel (because of the uniform ergodicity re- 
quirement). For potentially universal schemes like the slice sampler (Section 
13.2.5), the difficulty in implementing the univariate slice sampler reduces its 
appeal, and it is much more difficult to come up with monotone scheraes when 
using the multivariate slice sampler, as illustrated by the mixture examples. 
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13.5 Problems 



13.1 For the random walk on X = {1, 2, n}, show that forward coalescence neces- 
sarily occurs in one of the two points 1 and n, and deduce that forward coupling 
does not produce a simulation from the uniform distribution on X. 

13.2 Consider a finite Markov chain with transition matrix 

/.3 .3 .3\ 

P= 1 0 0 . 

0 0 / 



(a) Show that the Markov chain is ergodic with invariant measure (.6, .2, .2). 

(b) Show that if three chains x^\ and x^^^ are coupled, that is, if the three 
chains start from the three different states and 

T = inf{t > — X 2 ^ — , 

then the distribution of x[^^ is not the invariant measure but is rather the 
Dirac mass in the first state. 

13.3 (Foss and Tweedie, 1998) 

(a) Establish that the stochastic recursive sequence (13.1) is a valid represen- 
tation of any homogeneous Markov chain. 

(b) Prove Theorem 13.2. 

(c) In the proof of Theorem 13.3, we showed that coalescence occurs at some 
finite time. Use the Borel-Cantelli Lemma to obtain the stronger conclusion 
that the coalescence time is almost surely finite. 

13.4 (Mpller 2001) For the random mapping representation (13.1), assume that 

(i) the f/t’s are iid; 

(ii) there exists a state zu such that 

lim o • • • o ^o{w) E F) ^ 7t(F) VF C X ; 

t—*oo 

(iii) there exists an almost finite stopping time T such that 

o • • • o F_t(rt7) = % o ■■■ o F-t(^) Vt > T . 



Show that 

Fo o • • • O F-t(^) 7T . 

{Hint: Use the monotone convergence theorem.) 

13.5 We assume there exists some partial stochastic ordering of the state space X 

of a chain (x^*^), denoted 

(a) Show that if X denotes the set of extremal points of X (that is, for every 
y E X, there exist xi, X 2 E X such that x\ < y ^ X 2 ), the backward 
coupling algorithm needs to consider only the points of X at each step. 

(b) Study the improvement brought by the above modification in the case of 
the transition matrix 



0.34 0.22 0.44' 
0.28 0.26 0.46 
0.17 0.31 0.52 
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13.6 Given a sandwiching pair and such that :< :< 6t for all o;’s and 

all t’s, show the funneling property 

for all t < m < n. 

13.7 Consider a transition kernel K which satisfies Doeblin’s condition^ 

K{x,y) > r{x) , x,y ^ X , 

which is equivalent to uniform ergodicity (see, e.g., Theorem 6.59). As a perfect 
simulation technique, Murdoch and Green (1998) propose that the continuum 
of Markov chains coalesces into a single chain at each time with probability 




(a) Using a coupling argument, show that if for all chains the same uniform 
variable is used to decide whether the generation from r(x)/ p occurs, then 
at time 0 the resulting random variable is distributed from the stationary 
distribution tt. 

(b) Deduce that the practical implementation of the method implies that a sin- 
gle chain is started at a random moment —Nm the past from the bound- 
ing probability p~^r{x) with N ~ Qeo{p). Discuss why simulating from the 
residual kernel is not necessary. 

13.8 (Robert et al. 1999) Establish the validity of the boxing argument in Example 
13.7. 

13.9 (Casella et al. 2002) For the automatic coupling of Section 13.2.6, and an inde- 
pendent proposal with density h, show that the associated monotonicity under 

is preserved by the associated transition when the same uniform variable ut 
is used to update both chains. 

(a) Show that when the chains are ordered. 

(b) Consider the three possible cases: 

a) uj' is rejected for both chains; 

b) uj' is accepted for the first chain, but rejected by the second chain; 

c) uj' is accepted for both chains. 

and deduce that the order :< is preserved. {Hint: For cases (i) and (iii), 
this is trivial. For case (ii), the fact that uj' is rejected for the second chain 
implies that < 1, and thus that 7r{uj')/h{uj') < 7r{uj[*^)/h{uj[^^).) 

13.10 Show that the choice of the point uji such that 7t{9i) = 7t{ujo)/u} is immate- 
rial for the validation of the monotone slice sampler. 

13.11 In the setting of Example 13.11, find the maximum 1 for the numerical values 

13.12 (Mira et al. 2001) Assume that tt{0) can be decomposed as 7t{6) oc 7ri(^)7T2(^). 
Define the slice sampler associated with the subgraph of tt and determine the 
order under which this slice sampler is monotone. 

13.13 (Mira et al. 2001) For a probability density tt, define 

Q^riu) = p,{{9 ] 7t{9) > u}) . 

(a) Show that Qtt is a decreasing function. 




538 13 Perfect Sampling 



(b) Show that Qtt characterizes the slice sampler. 

(c) Show that, if tti and 7T2 are such that (u)/Qtti (u) is decreasing in u, the 

chains associated with the corresponding slice samplers, and (^ 2 ^^), 

are stochastically ordered, in that 

Pr (7ri(<^) < =i)>Pr (7T2(0^*^) < = $) . 

(d) Examine if Qt,-^{u)/Qt,^{u) is decreasing in u when 7Ti/7r2 is bounded from 
above. 

13.14 In the setting of the perfect slice sampler (Section 13.2.5), show that the 
maximum 1 does not need to be known to start the upper chain from 1 , but 
only an upper bound on maxTr. 

13.15 Show that, if the slice sampler is applied to a bounded density tt, it is uni- 
formly ergodic. {Hint: Show that the transition kernel is 



K{u 






{(jJ')'>UTx{(jl>) 



du = 



n{uj)) 



A 1 . 



and deduce a minor izing measure if 7t{uj) < M.) 

What about the converse? 

13.16 A motivation for Fill’s algorithm [A. 57] is that interrupting [A. 54] if it runs 
backward further than a fixed To is creating a bias in the resulting sample. Show 
that this is the case by considering a Markov chain with transition matrix 



/.I .4 .3 .2\ 
.0 .5 .5 .0 
.2 .0 .8 .0 
\.l .1 .0 .8/ 



and a fixed backward horizon To = 10. 

13.17 (Casella et al. 2001) Recall that Ct{z) denotes the event that all chains have 
coalesced to state 2 : at time T. Since [A. 57] delivers a value only if Ct{z) occurs, 
we want to establish that P[Ao = x\Ct{z)] = 7t(x). 

(a) Show that 



P[Xo = x\Ct{z)] - 



P[z x]P[Ct{z)\x z] 

E.' P[z ^ =^']P[Ct{z)\x' ^ z]- 



(b) Show that, for every x' , 



P\r^(z)\r' z] = and x' z] P[Ct{z)] 

^ ' P[x’-^z] P[x'-^z]' 

{Hint: Use the fact that coalescence entails that, for each starting point x', 
the chain is in 2 ; at time T.) 

(c) Note that P[x' z] = K'^{x\ z) and deduce that 



P[Xo = x\Ct{z\ 



K'^{z,x)P[Ct{z)]/K'^{x,z) 
Ex' x')P[CTiz)]/KT’{x', z) 

K'’' {z, x)lK^ {x, z) 

Ex' ^^(z,x')(K^(x',z) 
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(d) Use the detailed balance condition 7t{z)K^ ( z,x) — 7 t{x)K^{x,z) to deduce 

1^ r M 7r{x)/7r{z) , . 

P[Xo = x\Ct{z)] = = ^(^)- 

13.18 Wilson (2000) developed a read-once algorithm that goes forward only and 
does not require storing earlier values. For m G N, we define 

ajli(w) =^imO- • , 

/^i, K 2 , . . . as the successive indices i for which is a constant function, 



01 = , 

0i = o . . . o (^ > 1) , 

and Ti = hii — 1 — Ki-i. 

(a) Show that, if the /^i’s are almost surely finite, and if p is the probability 
that dJli is constant, then p > 0, 0i(a;) ~ tt for all c«;’s and i > 1, and 
Ti ~ Qeo{p). 

(b) Deduce validity of the read-once algorithm of (Wilson 2000): for an arbitrary 
cj, take ( 02 , • . • , 0n+i) as an iid sample from tt. 



13.6 Notes 



13.6.1 History 

Maybe surprisingly for such a young topic (Propp and Wilson published their sem- 
inal paper in 1996!), perfect sampling already enjoys a wealth of reviews and intro- 
ductory papers. Besides Casella et al. (2001) on which this chapter is partly based 
(in particular for the Beta-Binomial example), other surveys are given in Fismen 
(1998), Dimakos (2001), and in the large body of work of Mpller and coauthors, 
including the book by Mpller and Waagepetersen (2003). David Wilson also set up 
a website which catalogs papers, preprints, theses, and even the talks given about 
perfect sampling!® 

While our examples are statistical, there are other developments in the area of 
point processes and stochastic geometry, much from the work of Mpller and Kendall. 
In particular, Kendall and Mpller (2000) developed an alternative to Propp and Wil- 
son (1996) CFTP algorithm, called horizontal CFTP^ which mainly applies to point 
processes and is based on continuous time birth-and-death processes, but whose de- 
scription goes rather beyond the level of this book. See also Fernandez et al. (1999) 
for another horizontal CFTP algorithm for point processes. Berthelsen and Mpller 
(2003) use these algorithms for nonparametric Bayesian inference on pairvdse inter- 
action point processes like the Strauss process. While the complexity of these models 
goes, again, beyond the level of this book, it is interesting to note that the authors 
exhibit an exponential increase of the mean coalescence time with the parameters 
of the Strauss process. 



® Its current address is http://dimacs.rutgers.edU/dbwilson/exact.hi:ml/. 
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13.6.2 Perfect Sampling and Tempering 



Tempering, as introduced in Marinari and Parisi (1992), Geyer and Thompson 
(1995), and Neal (1999), appears as an anti- simulated annealing, in that the dis- 
tribution of interest tt appears as the “colder” or most concentrated distribution in 
a sequence (wn) of target distributions, wn = tt, with the “hotter” distributions 
xui, ... , WN-i being less and less concentrated. The purpose behind this sequence is 
to facilitate the moves of a Metropolis-Hastings sampler by somehow flattening the 
surface of the target distribution. This device is, for instance, used in Celeux et al. 
(2000) to impose a proper level of label switching in the simulation of the posterior 
distribution associated with a mixture of distributions. 

Following the description of Mpller and Nicholls (2004), a sequence of target 
distributions (tnn)i<n<iv* is thus chosen such that zun* = tt, wi is a well-known 
distribution for which simulation is possible, and wo is a distribution on a single 
atom, denoted 0. Each of the Wn^s is deflned on a possibly different space, i?n. Using 
the auxiliary variable N, which takes values in {0,1, . . . , N*}, and defining the joint 
distribution 

AT* 

fl{0, n) = Qn'^n(0) , Pn = 1 , 

n=Q 

a reversible jump (Section 11.2) algorithm can be constructed for this joint distribu- 
tion, for instance, with moves “up,” from n to n -h 1, “down,” from n to n — 1, and 
“fixed- level,” with only 0 changing. If pu and po are the probabilities of moving up 
and down, respectively, with 1 — pu — Pd the probability of choosing a fixed- level 
move, and if ^n->n+i(^^|^) and gn+i->n(^|^0 denote the proposal distributions for 
upward and downward moves, we thus assume that the usual symmetry requirement 
holds for qn^n+i and qn+i-^n, getting 



Qn-.n+l{0,0') 



7Tn+l TJJn+l{0')pD qn+\-^n{0\0') ^ ^ 
7Tn Wn {6) pu qn-^n+1 (0'\6) 



as the acceptance probability of an upward move. The symmetry assumption trans- 
lates into 

gn+l^n(0',e) = 

as the acceptance probability of a downward move. 



Example 13.16. Power tempering Assuming a compact space like [0, 1]^, a 
possible implementation of the above scheme is to define a sequence of powers ui = 
0 < . . . < UN* = 1, with 

L0 1% OC 7T 

These distributions are literally flattened versions of the target distribution tt. The 
corresponding acceptance probability for a move up from n to n -h 1 is then propor- 
tional to 

TTn+lPP 7T^^ + ^(^') qn-\-l-.n(0\0') 

TTnPU 7T^^{0) qn-^ri+l(0'\9) ' 

where the proportionality coefficient is the ratio of the normalizing constants of 
and . Neal (1999) bypasses this difficulty by designing an equal number 
of moves from n to n + 1 and from n -f- 1 to n, and by accepting the entire sequence 
as a single proposal, thus canceling the normalizing constants in the acceptance 
probability. || 
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Since the augmented chain (Ot^Nt) is associated with the joint distribution 
fjL{0,n), the subchain Ot, conditional on Nt = N*, is distributed from the target 
7T. Therefore, monitoring the output of the MCMC sampler for only the times when 
Nt = N* produces a subchain converging to the target distribution tt. This sounds 
like a costly device, since the simulations corresponding to the other vedues of Nt 
are lost, but the potential gain is in an enhanced exploration of the state space 
i?iv* . Note also that the other simulations can sometimes be recycled in importance 
sampling estimators. 

There are many applications of tempering in the MCMC literature (Geyer and 
Thompson 1992, Muller 1999, Celeux et al. 2000). What is of most interest in con- 
nection with this chapter is that it is fairly easy to define a dominating process on 
{0, 1, . . . , N*}, as shown by Mpller and Nicholls (2004). To achieve this domination, 
they assume in addition that 



<'tn — , 

'^n 

that is, that the upward and downward transitions are bounded away from 0. The 
dominating process Dt is then associated with the same proposal as Nt and the 
acceptance probabilities 



Oin — >-n+l 

Oin—^n — 1 
OCri — 

We couple Nt and Dt so that (i) the upward, downward, and fixed-level moves are 
chosen with the same probabilities, and (ii) the acceptance is based on the same 
uniform. Then, 



= min ( 1, Kn 



TTn+l 



. 7Tn-l \ 

= mm 1, , 

\ iTTn+l / 



1 . 



Lemma 13.17. The chain (Dt) dominates the chain (Nt) in that 

Pr(ATt+i < Dt-\-i\Nt — n , Dt = m) = 1 when n <m . 



This result obviously holds by construction of the acceptance probabilities 
an-^n-fi and Qn^n-i, since those favor upward moves against downward moves. 
Note that (Dt) is necessarily ergodic, as a Markov chain on a finite state space, and 
reversible, with invariant distribution 



Due to this domination result, a perfect sampler is then 
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Algorithm A. 58 -Tempered Perfect Sampling- 

1. Go backward far enough to find r* < 0 such that Z>r* = AT* and 
Dt = 0 for at feast one r* < t < 0, 

2- Simulate the coupled chain (BtyNt) r* steps, with 

3. Take 

{0(i,No) ^ fjL . 



The validity of this algorithm follows from the monotonicity lemma above, in that 
the (virtual) chains starting from all possible values in {0, 1, , N*} have coalesced 
by the time the chain starting from the largest value N* takes the lowest value 0. 
Note also that (0,0) is a particular renewal event for the chain 



Example 13.18. Flour beetle mortality. As described in Carlin and Louis 
(2001), a logit model is constructed for a flour beetle mortality dataset, correspond- 
ing to 8 levels of dosage uui of an insecticide, with ai exposed beetles and yiai killed 
beetles. The target density is 



r(/x, a, ly) oc 7r(yLt, i") H ^ 



expyi 






+ exp - 



with the prior density 7 t(/^, a, u) corresponding to a normal distribution on p, an 
inverted gamma distribution on and a gamma distribution on iy. 



Flour Beetle Mortality Fxampis 





r-- 


I 




'i i' ii' I 'i ■ 'i 


' W'l 


■■ -i 




z. 


i 


I 




I 


i I U I M I I 




1 1 




1 


- I 
















D 


L J 












I 

. ^ 




Fig. 13.6. Forward simulation of the dominating process Dt (above) ^ along with 
the tempering process Nt (below), {Source: Mpller and Nicholls 2004.) 



The tempering sequence chosen in Mpller and Nicholls (2004) is 



8 

ZUrt(/i, CT, ly) oc 7t(/X, (J, v) 



exp yi 



CT 







1 + exp ^ 
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with /3n* = 1, /3n < /3n+i, and 



= yp(l-yi) 



1-Vi 



the maximum of the logit term. The move transitions g^n^n+i and qn+i-^n are also 
chosen to be Dirac masses, that is, qn-^n-\-i{0'\9) = Therefore, as pu = po, 



Qn 



■n+l — 



7Tn+l 

TTn 



n 



1 + exp 



ail/ 



(4* 



/^n+l ~Pn 

< 

“ 7Tn 



and Kn = 1. Note that, since all that matters are the probability ratios TTn+i/TTn, 
the values of the normalizing constants of the WnS do not need to be known. (The 
terms £* are also introduced to scale the various terms in the likelihood.) 

In their numerical implementation, Mpller and Nicholls (2004) chose N* = 3, 
/3i = 0, /?2 = .06, and /3a = 0, as well as pu — Pd = -333. Figure 13.6 shows the 
path of the dominating process Dt, along with the tempering process Nt. Note the 
small number of times when Nt — 3, which are the times where 7t(/x, (J, u) is perfectly 
sampled from, and also the fact that, by construction, Nt < Dt. || 



Extensions and developments around these ideas can also be found in Brooks 
et al. (2002). 




14 



Iterated and Sequential Importance Sampling 



“The past is important, sir,” said Rebus, taking his leave. 
— Ian Rankin, The Black Book 



This chapter gives an introduction to sequential simulation methods, a collec- 
tion of algorithms that build both on MCMC methods and importance sam- 
pling, with importance sampling playing a key role. We will see the relevance 
of importance sampling and the limitations of standard MCMC methods in 
many settings, as we try to make the reader aware of important and ongoing 
developments in this area. In particular, we present an introduction to Pop- 
ulation Monte Carlo (Section 14.4), which extends these notions to a more 
general case, and subsumes MCMC methods 



14.1 Introduction 

At this stage of the book, importance sampling, as presented in Chapters 3 
and 4, may appear as a precursor of MCMC methods. This is because both 
approaches use the ratio of target /importance distributions, f /g, to build ap- 
proximations to the target distribution / and related integrals, and because 
MCMC methods appear to provide a broader framework, in particular be- 
cause they can derive proposals from earlier generations. This chapter will 
show that this vision is not quite true. Importance sampling can also be im- 
plemented with dependent proposal^ distributions and adaptive algorithms 

^ Here, we use a vocabulary slightly different from other chapters, often found in 
the particle filter literature: generations from importance distributions are often 
called particles; importance distributions will be called proposal distributions. 
Also, we restrict the use of the adjective sequential to settings where observations 
(or targets) appear according to a certain time process, rather than all at once. 
For other common uses of the word, we will prefer the denominations iterated or 
repeated. 
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that are much easier to build than in MCMC settings (Session 7.6.3), while 
providing unbiased estimators of integrals of interest (if all normalizing con- 
stants are known). 

Importance sampling naturally enhances parallel sampling and there are 
settings, as in Example 9.2, where MCMC has a very hard time converging 
to the distribution of interest while importance sampling, based on identical 
proposals, manages to reach regions of interest for the target distribution 
(Section 14.4). 

Importance sampling is also paramount in sequential settings, where a con- 
stant modification of the target distribution occurs with high frequency and 
precludes the use of standard MCMC algorithms because of computational 
constraints. The books by Doucet et al. (2001) and Liu (2001) are dedicated 
to this topic of sequential sampling methods and we thus refer the reader 
there for a more complete treatment of this topic. 



14.2 Generalized Importance Sampling 

As already stated in the above paragraph, an incorrect impression that might 
be drawn from the previous chapters is that importance sampling is solely a 
forerunner of MCMC methods, and that the latter overshadows this technique. 
As shown below in Section 14.4, a deeper understanding of the structures of 
both importance sampling and MCMC methods can lead to superior hybrid 
techniques that are importance sampling at the core, that is, they rely on 
unbiasedness at any given iteration, but borrow strength from iterated meth- 
ods and MCMC proposals. We first see how iterated adaptive importance 
sampling is possible and why dependent steps can be embedded within im- 
portance sampling with no negative consequence on the importance sampling 
fundamental identity. 

A first basic remark about importance sampling is that its fundamental 
unbiasedness property (14.1) is not necessarily jeopardized by dependences in 
the sample. For instance, the following result was established by MacEachern 
et al. (1999). 

Lemma 14.1. If f and g are two densities, with supp{f) C supp{g), and if 
Lj{x) = f{x)/g{x) is the associated importance weight, then for any kernel 
K{x,x’) with stationary distribution f, 

j uj{x) K{x, x) g{x)dx — f{x ) . 

This invariance result is an obvious corollary of the importance sampling 
fundamental identity (3.9), which we repeat here: 

Ef[h{X)] = h(x) g{x) dx . 



(14.1) 
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However, its consequences are quite interesting. Lemma 14.1 implies, in par- 
ticular, that MCMC transitions can be forced upon points of an importance 
sample with no effect on the weights. Therefore, the modification of an im- 
portance sample by MCMC transitions does not affect the unbiasedness as- 
sociated with the importance estimator (3.8) (even though it may miodify its 
variance structure), since 

E [ijj{X)h{X')] = J h{x') K{x, x’) g{x) dxdx' = E/ [h{X)] 

for any /-integrable function h. One may wonder why we need to apply an 
MCMC step in addition to importance sampling. The reason is that, since the 
MCMC kernel is tuned to the target distribution, it may correct to some extent 
a poor choice of importance function. This effect is not to be exaggerated, 
though, given that the weights do not change. For example, a point x with 
a large (small) weight lj{x) = f{x)/g{x)^ when moved to x' by the kernel 
K{x,x'), does keep its large (small) weight, even if x' is less (more) likely 
than X in terms of /. This drawback is linked to the fact that the kernel, 
rather than the proposal within the kernel, is used in Lemma 14.1. Section 
14.4 will discuss more adaptive schemes that bypass this difficulty. 

Although the relevance of using MCMC steps will become clearer in Sec- 
tion 14.3 for sequential settings, we point out at this stage that other general- 
izations can be found in the literature as the dynamic weighting of Wong and 
Liang (1997), where a sample of {xt^uJiYs is produced, with joint distribution 
g{x^uj) such that 

poo 

/ Lo g{x^(j) div (X f{x) . 

Jo 

See Note 14.6.2 for more details on the possible implementations of this ap- 
proach. 



14.3 Particle Systems 

14.3.1 Sequential Monte Carlo 

While this issue is rather peripheral to the purpose of the book, which focuses 
on “mainstream” statistical models, there exist practical settings where a se- 
quence of target distributions, (7Tt)t, is available (to some extent like normaliz- 
ing or marginalizing) and needs to be approximated under severe time/storage 
constraints. Such settings emerged in military and safety applications. For in- 
stance, 7Tt may be a posterior distribution on the position and the speed of 
a plane at time t given some noisy measures on these quantities. The diffi- 
culty of the problem is the time constraint: we assume that this constraint 
is too tight to hope to produce a reasonable Monte Carlo approxim8Ltion us- 
ing standard tools, either independent (like regular importance sampling) or 
dependent (like MCMC algorithms) sampling. 
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Example 14.2. Target tracking. The original example of Gordon et 
al. (1993) is a tracking problem where an object (e.g., a particle, a pedes- 
trian, or a ship) is observed through some noisy measurement of its angular 
position Zt at time t. Of interest are the position (X^, Yt) of the object in the 
plane and its speed {Xt,Yt) (using standard physics notation). The model is 
then discretized as Xt = Xt-\-i — Xt^Yt = and 

Xt = X,_i + T6f 

(14.2) Yt = Yt-i+rey 

Zt = arctan(Tt/Xt) -h , 

where ef , e\ and are iid A/*(0, 1) random variables. Whatever the reason for 
tracking this object, the distribution of interest is = 7r(0t|zi;t), where 

Ot = {r^rj, Xt^Yt) and zi:t denotes (in the signal processing literature) the 
vector Zt = (zi, . . . , z^). The prior distribution on this model includes the 
speed propagation equation (14.2), as well as priors on r, r] and the initial 
values {xo,yo,io,yo)- Figure 14.1 shows a simulated sequence {xt,yt) as well 
as the corresponding in a representation inspired from the graph in Gilks 
and Berzuini (2001). || 




Fig. 14.1. Simulated sequence of target moves and observed angles, for r = 0.2 
and rj = .05. 



Importance sampling seems ideally suited for this problem in that, if the 
distributions nt and 7Tt+i are defined in the same space, a sample distributed 
(approximately) according to nt can be recycled by importance sampling to 
produce a sample distributed (approximately) according to tt^+i. In the event 
that the state spaces (i.e., the supports) of tt^ and tt^+i are not of the same 
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dimension, it is often the case that the state space of is an augmentation 
of the state space of tt^ and thus that only a subvector needs be simulated to 
produce an approximate simulation from This is, in essence, the idea of 
sequential importance sampling as in, e.g., Hendry and Richard (1991) or Liu 
and Chen (1995). 

14.3.2 Hidden Markov Models 

A family of models where sequential importance sampling can be used is 
the family of hidden Markov models [abbreviated to HMM]. They consist of 
a bivariate process {Xt,Zt)t^ where the subprocess {Zt) is a homogeneous 
Markov chain on a state space ^ and, conditional on (Zt), (Xt) is a series 
of random variables on ^ such that the conditional distribution of Xt only 
depends on Zt, as represented in Figure 14.2. When is discrete, we have, 
in particular, 

(14.3) Zt\zt-i ^ P{Zt = i\zt-i = j) = pji , Xt\zt ^ f{x\^zt) , 

where denote the different values of the parameter. The process 

(Zt) which is usually referred to as the regime or the state of the model, is not 
observable (hence, hidden) and inference has to be carried out only in terms 
of the observable process (Xt). Numerous phenomena can be modeled this 
way; see Note 14.6.3. For instance. Example 14.2 is a special case of an HMM 
where, by a switch in notation, the observable is Zt and the hidden chain is 




Fig. 14.2. Directed acyclic graph (DAG) representation of the dependence struc- 
ture of a hidden Markov model, where (Xt) is the observable process and (Zt) the 
hidden process. 



When (Zt) is a Markov chain on a discrete space, the hidden Markov model 
is often called, somehow illogically, a hidden Markov chain. In the following 
example, as in Example 14.2, the support of Zt is continuous. 

Example 14.3. Stochastic volatility. Stochastic volatility models are 
quite popular in financial applications, especially in describing series with 
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D 1 DO 2430 300 400 SDO 



Fig. 14.3. Simulated sample (yt) of a stochastic volatility process (14.4) with /3 = 0, 
(7 = 1 and (p = 0.9. (Source: Mengersen et al. 1999.) 



sudden changes in the magnitude of variation of the observed values (see, 
e.g. Jacquier et al. 1994, Kim et al. 1998). They use a latent linear process 
(Zt), called the volatility^ to model the variance of the observables Xt in the 
following way: Let Zq ~ A/’(0, cr^) and, for t = 1, . . . , T, define 



(14.4) 



( Zt — (fZt-i + cre^_i , 
\Xt = , 



where Ct and are iid A/*(0, 1) random variables. Figure 14.3 shows a typical 
stochastic volatility behavior for /? = 0, cr = 1 and (p = 0.9. (For a similar 
model see Note 9.7.2.) || 



This setting is also quite representative of many applications of particle 
filters in state-space models,^ in that the data (xt) is available sequentially, 
and inference about Zt or Zt = (zi, . . . , zt) (and possibly fixed parameters as 
those of (14.3)) is conducted with a different distribution 7Tt(-|xi, . . . ,xt) = 
7rt(*|xt) at each time t, like 

t 

(14.5) 7Tt(zt|xt) OC Ylpze-izJ{xe\^ze) 

£=2 

and 

(14.6) TTt{zt\xt)= 

zt -1 

^ The term filtering is usually reserved for the conditional distribution of Zt given 
(Xi, . . . ,Xt), while smoothing is used for the conditional distributions of earlier 
hidden states, like Zk^ k < t, given (Xi, . . . ,Xt), and prediction is used for the 
conditional distribution of future hidden states. In most applications, like signal 
processing, the parameters of the model are of no interest to the analyst, whose 
focus is on the reconstruction of the hidden Markov process. See, e.g., Cappe 
et al. (2004). 
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in the discrete case. Note that the marginalization in (14.6) involves summing 
up over all possible values of Zt_i, that is, values when is of size p. A 
direct computation of this sum thus seems intractable, but there exist devices 
like the forward-backward or Baum- Welch formulas which compute this sum 
with complexity O(tp^), as described in Problems 14.5-14.7. 

In the case of 7Tt(zt|xt), when a sample of vectors Zt must be simulated 
at time t, a natural proposal (i.e., importance) distribution is, as in Gordon 
et al. (1993), to preserve the {t — 1) first components of the previous iteration, 
that is, to keep Zt-i unchanged, to use the predictive distribution 






as an importance function for the last component Zt of Zt, and to produce as 
the corresponding importance weight 



7Tt(Zt Xt) 

OC Ut-l ^ r 

7rt_i(zt_i|xt_i)7r(zt|2:t_i) 

(14.7) occJt_i/(xt|^^J. 

Note that 'K{zt\zt-i) is the conditional distribution of Zt given the p£ist of the 
Markov chain, which collapses to being conditional only on Zt~i because of 
the Markov property. (We have omitted the replication index in the above 
formulas for clarity’s sake: the reader will come to realize in the next section 
that n particles are simulated in parallel at each time t with corresponding 
weights i = 1, . . . , n.) 

Although we will use these models below for illustration purposes only, we 
refer the reader to Cappe et al. (2004) for a detailed presentation of asymp- 
totics, inference and computation in hidden Markov models, and Doucet et al. 
(2001) for additional perspectives on the computational difficulties and solu- 
tions. 



14.3.3 Weight Degeneracy 

There is a fundamental difficulty with a sequential application of the impor- 
tance sampling technique. The importance weight at iteration t gets updated 
by a recursive formula of the type oc where is, for example, 

an importance ratio for the additional component of the state space of nt , like 
ft{z[^^) / gt{z[^^). (See also the example of (14.7).) Therefore, in this case, 

oc exp log {fe{z.^^)/gt{z^i^) 

U=1 

and if we take the special occurrence where gt and ft are both independent 
of t, using the Law of Large Numbers for the sum within the exponential, we 
see that the right hand term above is approximately 
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exp {-tEg [\og{g(Z)/ f{Z)}]} . 



Since the Kullback-Leibler divergence [log g{Z) / f{Z)] is positive (see Note 

1.6.1), the weights thus have a tendency to degenerate to 0. This translates 
into a degradation (or attrition) of the importance sample along iterations. 
When t increases the weights are all close to 0 except for one which is 
close to 1, thus only one point of the sample contributes to the evaluation of 
integrals, which implies that the method is useless in this extreme case.^ 

Example 14.4. (Continuation of Example 14.2) Figure 14.4 illustrates 
this progressive degradation of the importance sample for the model (14.2). 
Assume that the importance function is, in that case, the predictive distri- 
bution 7r{xt,yt\xt-i,yt-i)- Then, if the distribution of interest is 7rt{kt,yt\zi, 

. . . ,zt), the corresponding importance weight is then updated as 

a scheme open to the degeneracy phenomenon mentioned above. Two esti- 
mators can be derived from these weights: the stepwise importance sampling 
estimate that plots, for each time t, the value of (xt^yt) that corresponds 
to the highest importance weight ujt^ and the final importance sampling esti- 
mate that corresponds to the path (xt,yt)i<t<T which is associated with the 
highest weight ujt at the end of the observation period (0,T). (As shown in 
Figure 14.4, the latter usually is smoother as it is more robust against diverg- 
ing paths.) As time increases, the importance sample deteriorates, compared 
with the true path, as shown by both estimates, and it does not produce a 
good approximation of the last part of the true target path, for the obvious 
reason that, once a single beginning (x^, yt)i<t<To has been chosen, there is 
no possible correction in future simulations. || 



14.3.4 Particle Filters 

The solution to this difficulty brought up by Gordon et al. (1993) is called 
a particle filter (or bootstrap filter). It avoids the degeneracy problem by re- 
sampling at each iteration the points of the sample (or particles) according 
to their weight (and then replacing the weights by one). This algorithm thus 
appears as a sequential version of the SIR algorithm mentioned in Problem 
3.18, where resampling is repeated at each iteration to produce a sample that 
is an empirical version of tt^. 

^ Things are even bleaker than they first appear. A weight close to 1 does not neces- 
sarily correspond to a simulated value that is important for the distribution 
of interest, but is more likely to correspond to a value with a relatively larger 
weight, albeit very small in absolute value. This is a logical consequence of the 
fact that the weights must sum to one. 
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Fig. 14.4. Simulated sequence of 50 target moves (full line), for r = 0.05 and rj = 
.01, along with the stepwise importance sampling (dashes) and the final importance 
sampling estimates (dots) for importance samples of size 1000. 



In the particular case Zt ~ where Zt = {Zt~i,Zt) (as in Section 

14.3.2), with importance/proposal distribution qt{z\zt-i) that simulates only 
the last component of Zt at time t, the algorithm of Gordon et al„ (1993) 
reads as follows.^ 

Algorithm A. 59 -Bootstrap Filter- 
At time t, 

1 Generate 

* = 1 ” 

and set 

2 Compute the importance weight 

wp cx n-t(z[‘’)/qt(4*Vi’2i)7rt-i{zj’2i) ■ 

3 Resample, with replacement, n particles from the z|*^’s ac- 
cording to the importance weights and set all weights to 
1/n. 

Therefore, there is no direct degeneracy phenomenon, since the weights 
are set back to 1/n after each resampling step in [A. 59]. The most relevant 



^ In the original version of the algorithm, the sampling distribution of in Step 
1. was restricted to the prior. 
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particles are simply duplicated to the extent of their relevance and the less 
relevant particles disappear from the sample. This algorithm thus provides at 
each step an approximation to the distribution nt by the empirical distribution 
(Problem 3.22) 

( 14 . 8 ) 

i=l 

Obviously, degeneracy may still occur if all the particles are simulated accord- 
ing to a scheme that has little or no connection with the distribution tt^. In 
that case, the “most relevant” particle may well be completely irrelevant after 
a few steps. 

Note that the original algorithm of Gordon et al. (1993) was designed for 
the simulation of the current Zt ~ 7rt(z) rather than the simulation of the entire 
path Zt. This will make a difference in the degeneracy phenomenon studied 
in Section 14.3.6 and in possible improvements (as the auxiliary particle filter 
of Pitt and Shephard 1999; see Problem 14.14). This is because the marginal 
distribution 7Tt{zt) is usually much more complex than the joint distribution 
7Tt{zt) (see Section 14.3.2), or even sometimes unavailable, as in semi-Markov 
models (see Note 14.6.3). 

Example 14.5. (Continuation of Example 14.2) If we use exactly the 
same proposal as in Example 14.4, when the target distribution is the posterior 
distribution of the entire path (z^,yt), with the same simulated path as in 
Figure 14.4, the reconstituted target path in Figure 14.5 is much closer to 
the true path, both for the stepwise and final particle filter estimates, both of 
which are based on averages (rather than maximal weights).^ || 



14.3.5 Sampling Strategies 

The resampling step of Algorithm [A. 59], while useful in fighting degener- 
acy, has a drawback. Picking the ’s from a multinomial scheme introduces 
unnecessary noise to the sampling algorithm, and this noise is far from negligi- 
ble since the variance of the multinomial distribution is of order 0(n). Several 
proposals can be found in the literature for the reduction of this extra Monte 
Carlo variation. One is the residual sampling of Liu and Chen (1995). Instead 
of resampling n points with replacement from the z^^^’s, this strategy reduces 
the variance by first taking copies of where [x\ denotes the inte- 
ger part of X. We then sample the remaining n — particles from 

the z^^^’s, with respective probabilities proportional to The 

^ The visual fit may seem very poor judging from Figure 14.5, but one must realize 
that the true path and the reconstituted path are quite close from the angle point 
of view, since the observer is at (0, 0). 




RIS 



14.3 Particle Systems 555 



I 




Fig. 14.5. Simulated sequence of 50 target moves, for r = 0.05 and r] = .01, along 
with the stepwise particle filter (dashes) and the final particle filter estimates (dots), 
based on 1000 particles. 



expected number of replications of is still and the sampling scheme 
is unbiased (Problems 14.9-14.11). 

Crisan et al. (1999) propose a similar solution where they define probabil- 
ities Wi oc with Nt-i being the previous number of particles, and 

they then take Nt = with 

J -h 1 with probability Nt-i — [Nt-i , 

[Nt-i ujI J otherwise. 

The drawback of this approach is that, although it is unbiased, it produces a 
random number of particles at each step and Nt thus may end up at either in- 
finity or zero. Following Whitley (1994), Kitagawa (1996) and Carpenter et al. 
(1999) go even further in reducing the extra Monte Carlo variation through 
systematic resampling. They produce a vector (mi,...,mn) of numbers of 
replications of the particles such that E[mi] = nuo]^^ by computing the 
vector (Cl, . . . ,^n) of the cumulative sums of {nuo]^\ . . . , generating a 
single u ^ U{[0,1]) and then allocating the m^’s as 

rrii = [Ci -^u\ - LCi-i + wj , z = 2, . . . , n - 1 

^1 = LCi + , mn = n- LCn-l + u\ . 

The amount of randomness is thus reduced to a single uniform variate. More 
specific results about the reduction of the variability of the estimates can be 
found in Chopin (2002) and Kiinsch (2004) 

Many topical modifications of the Gordon et al. (1993) scheme can be 
found in the literature, and we refer the reader to Doucet et al. (2001) for a 
detailed coverage of the differences. Let us point out at this stage, however, the 
smooth bootstrap technique of Hiirzeler and Kiiensch (1998) and Stavropoulos 
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and Titterington (2001), where the empirical distribution (14.8) is replaced 
with a (nonparametric) kernel approximation which smoothes the proposal 
into a continuous distribution, as we will encounter a similar proposal in 
Section 14.4. 

14.3.6 Fighting the Degeneracy 

When considering an entire path Zt simulated by the sole addition of only 
the current state zt at time t, as in Algorithm [A. 59], it is not difficult to 
realize that as t becomes large, the algorithm must degenerate. As in repeated 
applications of the bootstrap (Section 1.6.2), iterated multinomial sampling 
can deplete the sample all the way down to a single value. That is, since the 
first values of the path are never re-simulated, the multinomial selection step 
at each iteration reduces the number of different values for t < at a rate 
that is faster than exponential (Chopin 2002). Therefore, if we are interested 
in a long path, Zt, the basic particle filter algorithm simply does not work 
well. Note, however, that this is rarely the focus. Instead, the main interest in 
sequential Monte Carlo is to produce good approximations to the distribution 
of the process Zt at time t. 

Example 14.6. (Continuation of Example 14.2) Figure 14.6 represents 
the range of 100 paths simulated over T = 50 observations of the tracking 
model (14.2), that is, at the end of the observation sequence, after reweighting 
has taken place at each observational instant. While the range of these paths 
is never very wide, it is much thinner at the beginning, a result of the repeated 
resampling at each stage of the algorithm. || 




Fig. 14.6. Range of the particle filter path at the end of the observation sequence, 
along with the true (simulated) sequence of 50 target moves, for r = 0 . 05 , 77 = .01 
and 1000 particles. 
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If one wants to preserve the entire sample path with proper accuracy, the 
price to pay is, at each step, to increase the number of particles in compen- 
sation for the thinning (or depletion) of the sample. In general, this is not a 
small price, unfortunately, as the size must increase exponentially, as detailed 
in Section 14.3.7! In a specific setting, namely state space models^ Doucet 
et al. (2000) establish an optimality result on the best proposal distribution 
in terms of the variance of the importance weights, conditional on the past 
iterations. But this optimum is not necessarily achievable, as with the regular 
importance optimal solution of Theorem 3.12, and can at best slow down the 
depletion rate. 

Note that degeneracy can occur in a case that is a priori more favorable, 
that is, when the TTt’s all have the same support and where particles associ- 
ated with the previous target nt-i are resampled with weights For 

instance, if the targets move too quickly in terms of shape or support, very 
few particles will survive from one iteration to the next, leading to a fast 
depletion of the sample. 

Gilks and Berzuini (2001) have devised another solution to this problem, 
based on the introduction of additional MCMC moves. Their method is made 
of two parts, augmentation if the parameter space is different for tt^ and nt-i 
and evolution^ which is the MCMC step. The corresponding algorithm reads 
as follows. 

Algorithm A, 60 -Resample-Move Particle Filter— 

At iteration t, 

1. Augmentdtiori: For i = augment with its missing 

partp as zf ^ ^ and 

2. Weighting: Compute the corresponding importance weights for the 

target distribution TCt, ujf \ . . * , , as 

3. Evolution: Generate a new iid sample zf \ i = using a 

proposal proportional to 

i=l 

where qt is a Markov transition kernel with stationary distribution tt^. 

The basis of Algorithm [A. 60] is thus to pick up a point according 

to its weight that is, via multinomial sampling, and then to implement 

one iteration of a Markov transition that is stable with respect to tt^. Since 
the regular multinomial importance sampling produces a sample that is (em- 
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pirically) distributed from the target tt^ , the distribution of the final sample is 
indeed in (empirical) agreement with nt and is thus a valid importance sam- 
pling methodology. In other words, the importance sampling weight does not 
need to be adapted after the MCMC step, as already seen in Lemma 14.1. 

As above, reduced variance resampling strategies can also be implemented 
at the resampling stage. Note that the proposal kernel qt needs not to be 
known numerically, as long as simulations from qt are possible. Gilks and 
Berzuini (2001) also extend this algorithm to a model choice setting with se- 
quentially observed data, as in Section 11.2, and they use a reversible jump 
algorithm as part of the evolution step 3. Obviously, the success of this addi- 
tional step in fighting degeneracy depends on the setup and several iterations 
of the MCMC step could be necessary to reach regions of acceptable values 
of TTt. 

As noted in Clapp and Godsill (2001), the Augmentation step 1. in Al- 
gorithm [A. 60] can itself be supplemented by an importance sampling move 
where the entire vector may be modified through an importance distri- 

bution g (and a corresponding modification of the weight). These authors also 
suggested tempering schemes to smooth the bridge between g and the target 
distribution tt^. One example is to use a sequence of geometric averages 

(z) (X g^^ (z) (z) , 

where 0 < am < 1 increases from 0 to 1 (see Note 13.6.2 for an introduction 
to tempering). 

14.3.7 Convergence of Particle Systems 

So far, the sequential Monte Carlo methods have been discussed from a rather 
practical point of view, in that the convergence properties of the various es- 
timators or approximations have been derived from those of regular impor- 
tance sampling techniques, that is, mostly from the Law of Large Numbers. 
We must, however, take into account an additional dimension, when compared 
with importance sampling (as in Chapter 3), due to the iterative nature of the 
method. Earlier results, as those of Gilks and Berzuini (2001), assume that 
the sample size goes to infinity at each iteration, but this is unrealistic and 
contradicts the speed requirement which is fundamental to these methods. 
More recent results, as those of Kunsch (2004), give a better understanding 
of how these algorithms converge and make explicit the intuition that “there 
ain’t no such thing as a free lunch!, ” namely, that it does not seem possible 
to consider a sequence of target distributions without incurring an increase in 
the computational expense. 

Although the proofs are too advanced for the level of this book, let us 
point out here that Kunsch (2004) shows, for state-space models, that the 
number of particles in the sample, Nt, has to increase exponentially fast with 
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t, to achieve convergence in total variation for the sequential importance sam- 
pling/particle filter method. (This result is not surprising in light of the intu- 
itive development of Section 14.3.1.) Chopin (2004) also establishes Central 
Limit Theorems for both the multinomial and the residual sampling schemes, 
with the side result that the asymptotic variance is smaller for the residual 
sampling approach, again a rather comforting result. Note that for the basic 
sequential sampling in a fixed dimension setting where 






(i) 



OC LJ. 



(0 

t-i 



njzj) 



a Central Limit Theorem also holds: 



Vn 



(i) 









where 

by an application of the usual Central Limit Theorem, since the Z^’s are 
independent. (Establishing the Central Limit Theorems for the two other 
sampling schemes of Section 14.3.5 is beyond our scope.) 

In comparing the above three sampling schemes, Chopin (2004) obtained 
the important comparison that, for a fixed state space of dimension p, and 
under some regularity conditions for the Central Limit Theorem to hold, the 
three corresponding asymptotic variances satisfy 



Vt\h) = E,. 



yO(/i) = 0(tP/2-i) ^ = and V[ {h) = , 



where V^{h) and V[{h) denote the asymptotic variances for the estimators of 
ET^^[h{X)] based on the multinomial and the residual resampling schemes, re- 
spectively. Therefore, resampling is more explosive in t than the mere reweight- 
ing scheme. This is due to the fact that resampling creates less depletion (thus 
more variation), while always sampling from the same proposal. In all cases, 
to fight degeneracy, that is, explosive variance, the number of particles must 
increase as a power of t. Crisan et al. (1999), Del Moral and Miclo (2000) and 
Del Moral (2001) provide alternative (if mathematically more sophisticated) 
entries to the study of the convergence of particle filters. 



14.4 Population Monte Carlo 

The population Monte Carlo (PMC) algorithm, introduced in this section, is 
simultaneously an iterated importance sampling scheme that produces, at each 
iteration, a sample approximately simulated from a target distribution and 
an adaptive algorithm that calibrates the proposal distribution to the target 
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distribution along iterations. Its theoretical roots are thus within importance 
sampling and not within MCMC, despite its iterated features, in that the 
approximation to the target is valid (that is, unbiased at least to the order 
0(l/n)) at each iteration and does not require convergence times nor stopping 
rules. 



14.4.1 Sample Simulation 

In the previous chapters about MCMC, the stationary distribution has always 
been considered to be the limiting distribution of a Markov sequence (^t), with 
the practical consequence that Zt is approximately distributed from tt for t 
large enough. A rather straightforward extension of this perspective is to go 
from simulating a point distributed from tt to simulating a sample of size n 
distributed from tt or, rather, from 

n 

7r®"(2i,. . .,x„) = Jl7r(2j) . 



Implementations of this possible extension are found in Warnes (2001) and 
Mengersen and Robert (2003), with improvements over a naive programming 
of n parallel MCMC runs. Indeed, the entire sample at iteration t can be used 
to design a proposal at iteration t-\-l. In Warnes (2001), for instance, a kernel 
estimation of the target distribution based on the sample . . . , Zn~^^) 

is the proposal distribution. The difficulty with such a proposal is that multi- 
dimensional kernel estimators are notoriously poor. In Mengersen and Robert 
(2003), each point of the sample is moved using a random walk proposal that 
tries to avoid the other points of the sample by delayed rejection (Tierney 
and Mira 1998). Note that, for simulation purposes, a kernel estimator is not 
different from a random walk proposal. In both cases, it is more efficient to 
move each point of the sample separately, as the average acceptance prob- 
ability of the entire sample decreases with the sample size, no matter what 
the proposal distribution is, using the same Kullback-Leibler argument as in 
Section 14.3.3. However, as we will see next, the recourse to the theory of 
Markov chains to justify the convergence to tt^^ is not necessary to obtain 
a valid approximation of an iid sample from tt. 



14.4.2 General Iterative Importance Sampling 

The PMC algorithm can be described in a very general framework: it is indeed 
possible to consider different proposal distributions at each iteration and for 
each particle with this algorithm. That is, the zf"^'s can be simulated from 
distributions qu that may depend on past samples, 

zf ~ quiz) , 
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independently of one another (conditional on the past samples). Thus, each 
simulated point is allocated an importance weight 



ft) _ 






* r ’ 

QitiZi ) 

and approximations of the form 

= 



i=l 



are then unbiased estimators of E^[/i(Z)], even when the importance distri- 
bution Qit depends on the entire past of the experiment. Indeed, we have 



(14.9) 




h{x)qit{x) dx g(0 dC 
h{x)n{x) dxg{Q d( =E'^[h{X)] 



where ^ denotes the vector of past random variates that contribute to qu, and 
g{() its arbitrary distribution. Furthermore, assuming that the variances 



var 

exist for every 1 < i < n, which means that the proposals qu should have 
heavier tails than tt, we have 



(14.10) 



^ ^var , 

i=l 



due to the canceling effect of the weights (Problem 14.16). In fact, even if 
the are correlated, the importance-weighted terms will always be uncor- 
related (Lemma 12.11). So, for importance sampling estimators, the variance 
of the sum will equal the sum of the variances of the individual terms. 

Note that resampling may take place at some or even all iterations of the 
algorithm but that, contrary to the particle systems, there is no propagation 
of the weights across iterations. 

As in most settings the distribution of interest tt is unsealed, we instead 



use 



jf) „ 



(X 






f = l,...,n, 



scaled so that the weights sum up to 1. In this case, the above unbiasedness 
property and the variance decomposition are both lost, although they still 
approximately hold. In fact, the estimation of the normalizing constant of tt 
improves with each iteration t, since the overall average 
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(14.11) 



Wt — 



_ 1 _ 

tn 



t n 



EE 






is a convergent estimator of the inverse of the normalizing constant. Therefore, 
as t increases, Wt contributes less and less to the bias and variability of 3t 
and the above properties can be considered as holding for t large enough. In 
addition, if Wt-i in (14.11) is used instead of vot, that is, if 



(14.12) 









the variance decomposition (14.10) can be approximately recovered, via the 
same conditioning argument (Problem 14.16). 



14.4.3 Population Monte Carlo 

Following Iba (2000), Cappe et al. (2004) called their iterative approach popu- 
lation Monte Carlo ^ to stress the idea of repeatedly simulating an entire sample 
rather than iteratively simulating the points of an approximate sample. 

Since the above section establishes that an iterated importance sampling 
scheme based on sample dependent proposals is fundamentally a specific kind 
of importance sampling, we can propose the following algorithm, which is 
validated by the same principles as regular importance sampling. 

Algorithm A. 61 -Population Monte Carlo- 
For f = 1 T 

1. For 7 . = 1 , . . . , 

(i) Select the generating distribution 
(m) Generate ^ 

(iii) Compute = ^{zP)/qit{zP). 

2. Normalize the to sum to 1. 

3. Resample n values from the with replacement, using the weights 
Qp\ to create the sample {Zp , , . . , zP), 

Step l.(i) is singled out because it is an essential feature of the PMC algo- 
rithm: as demonstrated in the previous Section, the proposal distributions can 
be individually tailored at each step of the algorithm without jeopardizing the 
validity of the method. The proposals qu can therefore be picked according to 
the performances of the previous (possibly in terms of survival of the 

values generated with a particular proposal, or a variance criterion, or even on 
all the previously simulated samples, if storage allows). For instance, the qu's 
can include, with low probability, large tails proposals as in the defensive sam- 
pling strategy of Hesterberg (1998) (Section 3.3.2), to ensure finite variance. 
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Note the formal similarly with Warnes’s (2001) use of the previous sample to 
build a nonpar ametric kernel approximation to tt. The main difference is that 
the proposal does not aim at a good approximation of tt using standard non- 
parametric results like bandwidth selection, but may remain multiscaled over 
the iterations, as illustrated in Section 14.4.4 The main feature of the PMC 
algorithm is indeed that several scenarios can be tested in parallel and tuned 
along iterations, a feature that can hardly be achieved within the domain of 
MCMC algorithms (Section 7.6.3). 

There also are similarities between the PMC algorithm and earlier propos- 
als in the particle system literature, in particular with Algorithm [A. 60], since 
the latter also considers iterated samples with (SIR) resampling steps based on 
importance weights. A major difference, though (besides the dynamic setting 
of moving target distributions), is that [A. 60] remains an MCMC algorithm 
and thus needs to use Markov transition kernels with a given stationary distri- 
bution. There is also a connection with Chopin (2002), who considers iterated 
importance sampling with changing proposals. His setting is a special case of 
the PMC algorithm in a Bayesian framework, where the proposals qa are the 
posterior distributions associated with a portion kt of the observed dataset 
(and are thus independent of i and of the previous samples). As detailed in 
the following sections, the range of possible choices for the q^s is actually 
much wider. 



14.4.4 An Illustration for the Mixture Model 

Consider the normal mixture model of Example 5.19, that is, pA'’(/ii,l) + 
(1 — p)J\f{ii 2 , 1), where p ^ 1/2 is known, and the corresponding simulation 
from 7r(/ii, /i 2 |x), the posterior distribution for an iid sample x = (xi, . . . , Xn) 
and an arbitrary proper prior on (/ii, /Li2)* While we presented in Chapter 9 a 
Gibbs sampler based on a data augmentation step via the indicator variables, 
Celeux et al. (2003) show that a PMC sampler can be efficiently implemented 
without this augmentation step. 

Given the posterior distribution 

7r(/ii,//2|x) (X exp(-A(^ - /ii)^/2(j^) exp(-A(^ - /i2)^/2cr^) 

n 

H {pexp(-(xi - /ii)^/2(7^) + (1 -p)exp{-{xi - p. 2 f/ 2 a^)} , 

i=l 

a natural possibility is to choose a random walk for the proposal distribution 
(see Section 7.5). That is, starting from a sample of values of = (/xi,;l/ 2 ), 
generate random isotropic perturbations of the points of this sample. 

The difficult issue of selecting the scale of the random walk (see Section 
7.6), found in MCMC settings, can be bypassed by virtue of the adaptivity of 
the PMC algorithm. Indeed, if we take as proposals qu normal distributions 
centered at the points of the current sample, J\f 2 {fjtf\o'itl 2 )i the variance 




564 



14 Iterated and Sequential Importance Sampling 



factors ait can be chosen at random from a set of K scales Vk {I < k < K) 
ranging from, e.g., 10^ down to 10“^ if this range is compatible with the 
range of the observations. At each iteration t of the PMC algorithm, the 
probability of choosing a particular scale Vk can be calibrated according to the 
performance of the different scales over the previous iterations. For instance, 
possible criterion is to select a scale proportional to its non- degeneracy rate 
on the previous iterations, that is, the percentage of points associated with Vk 
that survived past the resampling step 3. The reasoning behind this scheme is 
that, if most associated with a given scale Vk are not resampled, the scale 
is not appropriate and thus should not be much used in the next iterations. 
However, when the survival rate is null, in order to avoid a definitive removal 
of the corresponding scale, the next probability (k is set to a positive value £. 

In order to smooth the selection of the scales, Rao-Blackwellization should 
also be used in the computation of the importance weights, using as the de- 
nominator 

k 

where v) here denotes the density of the two-dimensional normal dis- 

tribution with mean ^ and variance VI 2 at the vector /i. 

The corresponding PMC algorithm thus looks as follows. 

Algorithm A. 62 -Mixture PMC algorithm- 
Step 0: Initialization 

For i = I, . . . , 71 , generate from an arbitrary distribution 
For A; = 1, , FT, set Vk and — 1/AT- 

Step t: Update 

For i = 1, . . . , n, 

a. with probability Ct, take = Vk 

b. generate 

c. compute the weights 

Resample the using the weights Qi. 

Update the ^kS Qk oc ^ -h Vk where is the number of /x/s 
generated with variance Vk that have been resampled in the previous 
step. 
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The performance of this algorithm is illustrated on a simulated dataset of 
500 observations from 0.3A/’(0, 1) + 0.7A/’(2, 1). As described by Figure 14.7 
through the sequence of simulated samples, the result of the experiment is 
that, after 8 iterations of the PMC algorithm, the simulated /x’s are concen- 
trated around the mode of interest and the scales Vk (equal to .01, .05, .1 and 
.5) have been selected according to their relevance, that is, with large weights 
for the smallest values (as also described in Figure 14.8). While the second 
spurious mode is visited during the first iteration of the algorithm, the rela- 
tively small value of the posterior at this mode implies that the corresponding 
points are not resampled at the next iteration. 



Log-Posterior Iteration 12 3 




-2 -1 0 1 2 3 4 -2 -1 0 1 2 3 4 -2 -1 0 1 2 3 4 -2 -1 0 1 2 3 4 



Fig. 14.7. Representation of the log-posterior distribution via contours and of a 
sequence of PMC samples over the first 8 iterations. The sample of 500 observations 
was generated from 0.3A^(0, 1) + 0.7A/*(2, 1) and the prior on was a ^^(1, 10) 
distribution on both means. 



14.4.5 Adaptativity in Sequential Algorithms 

A central feature of the PMC method is that the generality in the choice 
of the proposal distributions qu is due to the abandonment of the MCMC 
framework. Indeed, were it not for the importance resampling correction, a 
pointwise Metropolis-Hastings algorithm would produce a parallel MCMC 
sampler which simply converges to the target tt^ in distribution. Similarly, 
a samplewise Metropolis-Hastings algorithm, that is, a Metropolis-Hastings 
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Fig. 14.8. Evolution of the cumulative weights of the four scales Vk — .5, .1, .05, .01 
for the mixture PMC algorithm over the first 8 iterations corresponding to Figure 



14.7. 



algorithm aimed at the target distribution tt^^, also produces an asymptotic 
approximation to this distribution, but its acceptance probability approxi- 
mately decreases as a power of n. This difference is not simply a theoretical 
advantage since, in one example of Cappe et al. (2004), it actually occurs that 
a Metropolis-Hastings scheme based on the same proposal does not work well 
while a PMC algorithm produces correct answers. 

Example 14.7. (Continuation of Example 14.3) For the stochastic 
volatility model (14.4), Celeux et al. (2003) consider a noninformative prior 
7 t(/ 3^, (y9, cr^) = 1 /(ct/ 3) under the stationarity constraint |(p| < 1. Posteriors 
on and are both conjugate, conditional on the zt’s, while the posterior 
distribution of (f is less conventional, but a standard proposal (Chib et al. 
2002) is a truncated normal distribution on ] — 1, 1[ with mean and variance 

f and 

t=2 t=2 t=2 

There have been many proposals in the literature for simulating the z^s (see 
Celeux et al. 2003). For instance, one based on a Taylor expansion of the 
exponential is a normal distribution with mean 

(l + ip^) + 0.5exp{-fj.t)yt (1 + ~ 0-5 

(1 + /(t 2 + 0.5 exp (-pt) 



and variance 

l/{ (1 + + 0-5 exp (-/it) , 

where ^ + ^t+i) /(1 + V^^) is the conditional expectation of Zt given 

Zt-l,Zt^l. 
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Celeux et al. (2003) use a simulated dataset of size n = 1000 with = 1, 
(f = 0.99 and = 0.01, in order to compare the performance of the PMC 
algorithm with an MCMC algorithm based on exactly the same proposals. 

First, the results of the MCMC algorithm, based on 10,000 iterations, 
are presented in Figures 14.9 and 14.10. The estimate of 0^ (over the last 
5000 simulated values) is 0.98, while the estimate of ip is equal to 0.89 and 
the estimate of is equal to 0.099. While the reconstituted volatilities are on 
average close to the true values, the parameter estimates are rather poor, even 
though the cumulative averages of Figure 14.9. do not exhibit any difficulty 
with convergence. Note, however, the slow mixing on (3 in Figure 14.9 (upper 
left) and, to a lesser degree, on (middle left). 
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Fig. 14.9. Evolution of the MCMC samples for the three parameters (left) and 
convergence of the MCMC estimators (right). {Source: Celeux et al. 2003.) 



Then, with the same proposal distributions, Celeux et al. (2003) have 
iterated a PMC algorithm ten times with M = 1000. The results are presented 
in Figures 14.11 and 14.12. The estimate of p (over the 10 iterations) is equal 
to 0.87, while the estimate of is equal to 0.012 and the estimate of 0^ is 
equal to 1.04. These estimations are clearly closer to the true values than the 
ones obtained with the MCMC algorithm. (Note that the scales on Figure 
14.11 (left) are much smaller than those of Figure 14.9 (right).) Moreover, 
Figure 14.12 provides an excellent reconstitution of the volatilities. || 
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Fig. 14 . 12 . Comparison of the true volatility (black) with the PMC estimation 
based on the 10th iteration weighted PMC sample (grey). (Source: Celeux et al. 
2003.) 



As shown in Cappe et al. (2004) (see also West 1992 and Guillin et al. 
2004), the PMC framework allows, in addition, for a construction of adaptive 
schemes, i.e., of proposals that correct themselves against past performances, 
that is much easier than in MCMC setups, as described in Section 7.6.3. 
Indeed, from a theoretical point of view, ergodicity is not an issue for PMC 
methods since the validity is obtained via importance sampling justifications 
and, from a practical point of view, the total freedom allowed by unrestricted 
parallel simulations is a major asset. 

An extension of the PMC method can be found in Del Moral and Doucet 
(2003): their generalization is to consider two Markov kernels, and AT_, 
such that, given a particle at time t — a new particle is generated 

from K^(xf~^\x) and associated with a weight 



(X 



n{xf'')K-{xf\xf 



As in Algorithm [A.61], the sample . . . , Xn^) is then obtained by multi- 
nomial sampling from the xf^’s, using the weights The most intriguing 
feature of this extension is that the kernel AT_ is irrelevant for the unbiasedness 
of the new sample. Indeed, 



E 









(t) — 1)\ 






^(t— 1)\ 



= J h(x)7r(x)K-(x,x) dxdx 
= E" [h(X)] , 



whatever K- is chosen; the only requirement is that K-(x,x) integrates to 1 
as a function of x (which is reminiscent of Monte Carlo marginalization; see 
Problem 3.21). 

While this scheme does provide a valid PMC algorithm, what remains 
to be assessed is whether wealth is a mixed blessing., that is, if the added 
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value brought by the generality of the choice of the kernels and K- is 
substantial. For example, Doucet et al. (2000) show that the optimal choice 
of in terms of conditional variance of the weights, is 

(x, x) oc ix(x)K- ^ 

in which case 

(i) J Tr{u)K_{u,xf~^^)du 

depends only on the previous value This is both a positive feature, 

given that only resampled x\^^ need to be simulated, and a negative feature, 
given that the current sample is chosen in terms of the previous sample, even 
though K- is arbitrary. Note also that K- must be chosen so that 

J 7r{u)K-{u,xf~^^)du 

is computable. Conversely, if is chosen in a symmetric way, 

(x, x) (X 7r(x)i^+ (x, x) , 



then 



LA. , . 

f TT{u)K^{u,x\^^)du 

depends only on the current value xf\ In the special case where K+ is a 
Markov kernel with stationary distribution tt, the integral can be computed 
and the importance weight is one, which is natural, but this also shows that 
the method is in this case nothing but parallel MCMC sampling. (See also 
Del Moral and Doucet 2003 for extensions.) 



14.5 Problems 



14.1 (MacEachern et al. 1999) Consider densities / and g such that 

fi{zi) = J f{zi,Z 2 )dz 2 and gi{zi) = J g(zi,Z 2 )dz 2 . 

(a) Show that 




where var^^ denotes the variance under the joint density g{zi, Z 2 ). 

(b) Justify a Rao-Blackwellization argument that favors the integration of aux- 
iliary variables in importance samplers. 

14.2 Consider Algorithm [A. 63] with 6 = 1 and J = 0. 
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(a) Show that the new weight is 



^/_l^+l if u < g/g + 1 , 
lo;(^-l-l) otherwise. 

(b) Show that the expectation of uj' when integrating in U ^ ^([0, 1]) is a; + ^. 

(c) Deduce that the sequence of weights is explosive. 

14.3 For the Q-move proposal of Liu et al. (2001) (see the discussion after Algorithm 
[A. 63]) show that the average value of u' is auj{l — g) g if 6 = 1 and deduce a 
condition on a for the expectation of this average to increase. 

14.4 Consider a hidden Markov model with hidden Markov chain (Xt) on {1, . . . , k}, 
associated with a transition matrix P and observable Yt such that 



Yt\Xt=xt^f{yt\xt). 



(a) Show the actualization equations are given by 



p(xi,t\yi:t) 



f{yt\xt)p{xi,t\yi,(t-i)) 

p{yt\vi:(t-l)) 



and 

P(^l:tbl:(t-1)) = Pxt_iXtP(Xi:(t_i)|2/i:(t_i)) , 

where f{yt\xt) is the density fe^. {yt) and Pnm denotes the (n, m)-th element 
of P. 

(b) Deduce from 



( I ^ f{yt\xt)^xt-ixt ( I 

that computation of the filtering density p{xi.t\yi:t) has the same complexity 
as the computation of the density p{yt\yi:(t-i))‘ 

(c) Verify the propagation equation 

p(xt\yi-.(t-l)) == P(a^l:(i-l)l3/l:(t-l))P*.-l^t • 

14.5 (Continuation of Problem 14.4) In this problem, we establish the forward- 
backward equations for hidden Markov chains, also called Baum- Welch formulas 
(Baum and Petrie 1966). 

(a) Show that, if we denote (i = 1, . . . , /^) 

7 t(i) = P{xt = i\yi:T) t <T 



and we set 



o^t{i) = p{yi:t, Xt = i), (3t{i) = p{yt+i:T, Xt =i), 

then we have the following recursive equations 

{ ai(i) = f(yilxt = i)7Ti 
at+i(j) = f{yt+i\xt+i = j)y2at{t)Fij 
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and 



pT{i) = 1 

K, 

Mi) = '^T^ijfiyt+i\xt+i =j)0t+i(j) 



Jt{i) = at{i)(3t(i) / Y^OH{j)Pt{j) - 



j=i 



(b) Deduce that the computation of 7 t is achieved in 0{Tk^) operations. 

(c) If we denote 



= P{xt =i,Xt + l =j\yi:T) ij = 1,...,/^, 



show that 
(14.13) 



: / . -X _ at{i)aijf{yt+i\xt =j)Pt+i{j) 

J ) K K ’ 

EE o^t{i)aijf{yt+i\xt+i =j)Pt+i{j) 

j=i 



which can still be obtained in 0{Tk,^) operations. 

(d) When computing the forward and backward variables, at and j3t, there often 
occur numerical problems of overflow or underflow. Show that, if the at(z)’s 
are renormalized by ct — J2i=i ^t{i) at each stage t of the recursions in part 

(a), this renormalization, while modifying the backward variables, does not 
change the validity of the jt relations. 

14.6 (Continuation of Problem 14.5) In this problem, we introduce backward 
smoothing and the Viterbi algorithm for the prediction of the hidden Markov 
chain. 

(a) Show that p(xs|xs-i,2/i:t) = p{xs\xs-i,ys:t). 

(b) Establish the global backward equations (s = t, t — 1, . . . , 1) 

p{Xs\Xs-l,yi:t) oc f{ys\Xs) ^ p{x s + l\x s , yi:t) , 

with p{xt\xt-i,yi:t) oc Fxt_ixtf{yt\xt) , and deduce that 

p{xi\yi:t) oc n{xi) f(yi\xi) ’^p{x 2 \xi,yi:t ) , 

where tt is the stationary distribution of P, can be computed in 0{tfl^) 
operations. 

(c) Show that the joint distribution p{xi:t\yi:t) can be computed and simulated 
in 0{tK^) operations. 

(d) Show that, if the model parameters are known, the backward equations 
in part (a) allow for a sequential maximization of p{xi:t\yi:t)- {Note: This 
algorithm comes from signal processing and is called the Viterbi algorithm, 
even though it is simply a special case of dynamic programming.) 

(e) Deduce from Bayes formula that 

p{Xl:t\yi:t) 
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and derive from the backward equations of part (a) a representation of the 
observed likelihood. {Hint: Since the left hand side of this equation does not 
depend on xi:t, one can use an arbitrary (but fixed) sequence Xi,t on fhe 
right hand side.) 

(f) Conclude that the likelihood of a hidden Markov model with k states and 
T observations can be computed in O(Tk^) operations. 

14.7 (Continuation of Problem 14.6) (Cappe 2001) In this problem, we show that 
both the likelihood and its gradient can be computed for a hidden Markov chain. 
We denote by 0 the parameters of the distributions of the yt^s conditional on the 
xt’s, and the parameters of P. We also suppose that the (initial) distribution 
of xi, denoted is fixed and known. 

(a) We call 

=p{xt = i\yut-i) 
the prediction filter. Verify the forward equations 



ipi(j) = p{xi =j) 
1 "" 

Pt+iU) = 

Ct 



where 

K 



Ct = ^f{yt\xt = k)<pt{k), 
k = l 

which are based on the same principle as the backward equations, 

(b) Deduce that the (log-) likelihood can be written as 



t 

logp(2/l:f) = ]^log 

r=l 



K 

'^P{yt,Xt = i|3/l:(r-l)) 



t r K, 

= XI '^f{yt\xt = . 

r=l 



(c) Compare this derivation with the one of question (e) in Problem 14.6, in 
terms of computational complexity and storage requirements. 

(d) Show that the gradient against 0 of the log-likelihood is 



Velogp(yi:t) = X — 

r=l 

K 

X [‘Pr{i)'^0f{yr\Xr = l) + f{yr\Xr = l)V ePr(i)] , 
where Ve(^t(z) can be computed recursively as 



Vept+iU) = — X 
7^1 

X [pt{i)'Vef{yt\xt = i) + f(yt\xt = i)Vept{i)] 
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(e) Show that the gradient against rj of the log-likelihood is 

t ^ K 

Vr, logp{yi:t) = ^ ^ f{yt\xt = , 

T=1 ^ i=l 

where can be computed recursively as 

1 

Vr7^t+l(j) = — V/(t/t|xt =i) 
i=l 

(f) Show that the complexity of these computations is in 0{tK^). 

(g) Show that the storage requirements for the gradient computation are con- 
stant in t and of order 0{pn), if p is the dimension of (^, p). 

14.8 (Continuation of Problem 14.7) In the case where P is known and f{y\x) is 

the density of the normal distribution explicitly write the gradient 

equations of part (d). 

14.9 Show that residual sampling, defined in Section 14.3.5, is unbiased, that is, if 

denotes the number of replications of then 

14.10 (Continuation of Problem 14.9) Show that the sampling method of Crisan 
et al. (1999), defined in Section 14.3.5, is unbiased. Write an explicit formula 
for the probability that Nt goes to 0 or to oo as t increases. 

14.11 (Continuation of Problem 14.10) For the sampling method of Carpenter 
et al. (1999) defined in Section 14.3.5, 

(a) Show that it is unbiased. 

(b) Examine whether the variance of the rrii ’s is smaller than for the two above 
methods. 

(c) Show that the simulated points are dependent. 

14.12 (Continuation of Problem 14.11) Done (2004) proposes an alternative to 

residual sampling called comb sampling. It starts as residual sampling by sam- 
pling \ copies of (i = 1, . . . ,n). Denoting Nr = n — \ ^ 

Fn the empirical cdf associated with the the remaining Nr points 

are sampled by taking one point at random in each interval (k/Nr,{k F 1)/Nr) 
(/c = 0, . . . , A^r — 1) and inverting this sample by Fn- 

(a) Show that this scheme is unbiased. 

(b) Show that the simulated points are independent. 

(c) Show that the corresponding variance is smaller than for the residual sam- 
pling. 

14.13 For the stochastic volatility model of Example 14.3, when (3 = 0: 

(a) Examine the identifiability of the parameters. 

(b) Show that the observed likelihood satisfies 

L(p,cr\yo, ...,yT) = E[L‘'(v?,i7|j/o, . . . ,3/r, ^o, ■ ■ . ,^t)|2/o, ■ ■ • .yr], 

T 

cr|xo, . . . ,XTjZo,. . . ,zt) oc exp - ^ F Zt} /2 

t=o 

exp - |(2:o)^ + ~ /2{crf . 



where 
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(c) Examine whether or not an EM algorithm can be implemented in this case. 

(d) For a noninformative prior, 7r{(p,a) = l/cr, determine whether or not the 
posterior distribution is well-defined. 

(e) Examine the data augmentation algorithm for drawing Bayesian inference 
on this model, and show that the simulation of the volatilities Zt cannot be 
done using a standard distribution. 

(f) Derive an Metropolis-Hastings modification based on an approximation of 
the full conditional 

f{zt\zt-i,zt+i,y\(f,a) oc exp{-( 2 ;t - (pzt-ifl2a‘^ 

-{zt+i - ^ztf I2(j‘^ - ztl2 - x?e“'V2} • 

{Hint: For instance, the expression Ztj2 + j 2 in the exponential can 

be approximated by {zt — log(x?))^/2, leading to the Metropolis-Hastings 
proposal 

V { + \og{xi)/2 1 \ 

V (1 + ^ 2 ) c -2 + 1/2 ’ (1 + ^ 2 ) c -2 + 1 / 2 ; • 

See Shephard and Pitt (1997) for another proposal.) 

14.14 In the case of an hidden Markov model defined in Section 14.3.2, Pitt and 
Shephard (1999) propose a modification of the bootstrap filter [A. 59] called the 
auxiliary particle filter. 

(a) Show that, if . . . , z[^-^ is the sample of simulated missing variables at 

time t — 1, with weights the distribution 

n{zt\xi , . . . ,Xt) oc f{xt\zt) 

i 

provides an unbiased approximation to the predictive distribution 7r(^t|xi, 

(b) Show that sampling from the joint distribution 

^{zt, n\xi , . . . ,xt) oc f{xt\zt)uj['^^ 'K{zt\z ['^^) , 

with K = 1, . . . , n, is equivalent to a simulation from tt of part (a). 

(c) Deduce that this representation has the appealing feature of reweighting 
the particles ztlz^.^}-^ by the density of the current observable, f{xt\zt)^ and 
that it is thus more efficient than the standard bootstrap filter. 

14.15 Using the same decomposition as in (14.9), and conditioning on a random 
variable ^ that determines both proposal distributions qu and qjt^ show that 
g\*^h{x\^^) and g^ph{x^p) are uncorrelated and deduce (14.10). (The argument 
is similar that of Lemma 12.11.) 

14.16 (a) For the PMC algorithm [A. 61] show that when the normalizing constant 

of 7T is unknown and replaced with the estimator l/tJ7t_i, as in (14.12), the 
variance decomposition (14.10) approximately holds for large t’s. 

(b) Prom the decomposition (Problem 12.14) 

var(Jt) = var^(Epf|C])E^[var(2ft|C)] , 

show that the term var(^(Eptp]) is of order 0(l/t) and is thus negligible. 
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(c) Show that the conditional variance var(CJt|C) satisfies 

var(3t|C) = ■ 

i 

14.17 Consider a hidden Markov Poisson model: the observed xPs depend on an 
unobserved Markov chain (Zt) such that (z, j = 1,2) 



Xt\zt ~ P{Zt = i\zti = j) = pji . 



(a) When k = 2 and a noninformat ive prior, 

7t(Ai, A2,Pii,P22) = ^Ia2<Ai , 

Ai 

is used, construct the Gibbs sampler for this model. {Hint: Use the hidden 
Markov chain (Zt).) 

(b) For a prior distribution (j = 1, . . . , A:) 

\j ~ ga{aj,Pj) , 

show that the parameter simulation step in the Gibbs sampler can be chosen 
as the generation of 

Xj ~ Qa{aj + njXj,pj + nj) 

and 

Pj ~ D(7i + riji , . . . , 7fc + njfc) , 

where 

n n n 

nji = rij = and rijXj = '^lzt=jXt. 

t = 2 t = l t = l 

(c) Design a reversible jump algorithm (Chapter 11) for the extension of the 
above model to the case when k no- 

{Note: Robert and Titterington (1998) detail the Bayesian estimation of such a 
hidden Markov Poisson model via Gibbs sampling.) 

14.18 Show that a semi-Markov chain, as introduced in Note 14.6.3, is a Markov 
chain in the special case where the duration in each state is a geometric random 
variable. 

14.19 Evaluate the degeneracy of a regular random walk on a n x n grid by the 
following simulation experiment: for an increasing sequence of values of (n,p), 
give the empirical frequency of random walks of length p that are not self- 
avoiding. 

14.20 Establish Lemma 14.8 by showing that, if ^ > 0, and if denote the distri- 
bution of the augmented variable (X, W) after the step, then 

noo 

/ cu' g+{x',w')doj' = 2{l + S)cof{x'), 

Jo 



where co denotes the proportionality constant in (14.14). If ^ = 0, show that 

roo 

/ Uj' g+(x',oj')dLj' = (1 + (5)co/(x'). 

Jo 



and conclude that (14.14) holds. 




14.6 Notes 577 



14.21 If the initial distribution of (X,u) satisfies (14.14), show that the R-moves 
preserve the equilibrium equation (14.14). {Hint: If denotes the distribution 
of the augmented variable (X, W) after the step, show that 

j g+(x',u}')duj' = 2cf{x')^ 



14.6 Notes 

14.6.1 A Brief History of Particle Systems 

The realization of the possibilities of iterating importance sampling is not new: 
in fact, it is about as old as Monte Carlo methods themselves! It can be found 
in the molecular simulation literature of the 50 ’s, as in Hammersley and Morton 
(1954), Rosenbluth and Rosenbluth (1955) and Marshall (1965).^ Hammersley and 
colleagues proposed such a method to simulate a self-avoiding random walk (Madras 
and Slade 1993) on a grid, due to huge inefficiency in regular importance sampling 
and rejection techniques (Problem 14.19). Although this early implementation oc- 
curred in particle physics, the use of the term “particle” only dates back to Kitagawa 
(1996), while Carpenter et al. (1999) coined the term “particle filter”. In signal pro- 
cessing, early occurrences of a “particle filter” can be traced back to Handschin and 
Mayne (1969). 



14.6.2 Dynamic Importance Sampling 

This generalization of regular importance sampling, which incorporates the use of 
an (importance) weight within an MCMC framework, is due to Wong and Liang 
(1997), followed by Liu et al. (2001) and Liang (2002). At each iteration t, or for 
each index t of a sample, the current state Xt is associated with a weight in such 
a way that the joint distribution of (Xt^out), say g{x,u;), satisfies 

roG 

(14.14) / ou g{x,Lu) dou (X f{x) ^ 

Jo 

where / is the target distribution. 

Then, for a function h such that E/[|/i(X)|] < oo, the importance sampling 
identity (14.1) generalizes into 

Eg[h{X)W]/E,[W]=Ef[h{X)] 



and the weighted average 



T 

t = l 







is a convergent estimator of the integral 

® So we find, once more, Hammersley and some of the authors of the Metropolis 
et al. (1953) paper at the forefront of a simulation advance! 
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^ — J h{x)f{x)dx. 

Obviously, regular importance sampling is a special case where Xt is marginally 
distributed from g and, conditional on xt, uj is deterministic, with uj oc f(x)/g{x). 
A more general scheme provided by Liang (2002) is based on a transition kernel K 
as follows. 



Algorithm A, 63 -Liang’s Dynamic Importance Sampling- 



At iteration L given 




1 Generate y ^ K{xt^y) and compute 




f(xt)K(xt,y) ■ 




2 Generate u ^ W(0, 1) and take 




\{xt,(l + 5)wt/(l-a)) 


if 1/. < a , 
otherwise, 


where a — Q/{e + 9{xt,iJt)), and 0 and J > 0 are both either constant 
or independent random variables. 



The intuition behind this construction is that the increases in the weights allow 
for moves that would not be allowed by regular Metropolis-Hastings algorithms. 
But the drawback of this construction is that the importance weights diverge (in t ) , 
as shown in Problem 14.2. (Note that the ratio q/{q-\-0) is Boltzman’s ratio (7.20).) 

In comparison, Liu et al. (2001) use a more Metropolis- like ratio, in the so-called 
Q-move alternative, where 



, ._\{y,GyQ) iiu<l^Q/e, 

I (xt , aut ) otherwise, 

with a > 1 either a constant or an independent random variable. In the spirit of 
Peskun (1973) (Lemma 10.22) Liang (2002) also establishes an inequality. That is, 
the acceptance probability is lower for his algorithm compared with a Metropolis- 
Hastings move. 

Liang (2002) distinguishes between the R-move, where (^ = 0 and 9 = 1^ yielding 

( X J(2/,e + l) \iu< q/{q+1) 

+ 1)) otherwise, 

and the W -move, where ^ = 0, a = 1 and 

(xt+i,a;(+i) = {y,Q). 

Wong and Liang (1997) also mention the M-move, which is essentially the result of 
MacEachern et al. (1999): in the above algorithm, when / is the stationary distri- 
bution of the kernel K, g = ujt and (xt+i, u;t+i) = (y,ujt). Note that all moves are 
identical when ^ = 0. 
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As one can see in the simple case of the VF-move, these schemes are not valid 
MCMC transitions in that / is not the marginal distribution of the stationary dis- 
tribution associated with the transition. 

The general result for Liang’s Algorithm [A. 63] is as follows. 

Proposition 14.8. If the initial distribution of {X,u) satisfies (14-14), one itera- 
tion of [A. 63] preserves the equilibrium equation (14-14)- 



Liu et al. (2001) show that the Q-move is only approximately correctly weighted, 
the approximation being related to the condition that the average acceptance prob- 
ability 



/ 



1 A a; 



f{z)K{z,x) 

f{x)K{x,z) 



6K{x^ z)dz 



be almost independent from uj. See Liu et al. (2001) and Liang (2002) for a detailed 
study of the instability of the weights u along iterations. 



14.6.3 Hidden Markov Models 

Hidden Markov models and their generalizations have enjoyed an extremely wide- 
spread use in applied problems. For example, in Econometrics, regression models 
and switching time series are particular cases of (14.3) (see Goldfeld and Quandt 
1973, Albert and Chib 1993, McCulloch and Tsay 1994, Shephard 1994, Chib 1996, 
Billio and Monfort 1998). Hidden Markov chains also appear in character and speech 
processing (Juang and Rabiner 1991), in medicine (Celeux and Clairambault 1992, 
Cuihenneuc-Jouyaux et al. 1998) in genetics (Churchill 1989, 1995), in engineering 
(Cocozza-Thivent and Cuedon 1990) and in neural networks (Juang 1984). We refer 
the reader to MacDonald and Zucchini (1997) and Cappe et al. (2004) for additional 
references on the applications of Hidden Markov modeling. 

Decoux (1997) proposes one of the first extensions to hidden semi-Markov chains, 
where the observations remain in a given state during a random number of epochs 
following a Poisson distribution and then move to another state (see Problem 14.18). 
Cuihenneuc-Jouyaux and Richardson (1996) also propose a Markov chain Monte 
Carlo algorithm for the processing of a Markov process on a finite state space which 
corresponds to successive degrees of seropositivity. 

There are many potential applications of the hidden semi-Markov modeling to 
settings where the sojourns in each state are too variable or too long to be modeled 
as a Markov process. For instance, in the modeling of DNA as a sequence of Markov 
chains on {A,C,G,T}, it is possible to have thousands of consecutive bases in the 
same (hidden) state; to model this with a hidden Markov chain is unrealistic as it 
leads to very small transitions with unlikely magnitudes like 10“^®. 

One particular application of this modeling can be found for a specnfic neu- 
robiological model called the ion channel, which is a formalized representation of 
ion exchanges between neurons as neurotransmission regulators in neurobicJogy. Ion 
channels can be in one of several states, each state corresponding to a given electric 
intensity. These intensities are only indirectly observed, via so-called patch clamp 
recordings, which are intensity variations. The observables (yt)i<t<T are thus di- 
rected by a hidden Camma (indicator) process {xt)i<t<T, 

yt\xt ~ , 
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with the hidden process such that 

dj+i = tj+i — tj ~ 0a(si, Ai) 

if = z for tj <t < tj+i. References to the ion channel model are Ball et al. (1999), 
Hodgson (1999), and Hodgson and Green (1999). Cappe et al. (2004) re-examine this 
model via the population Monte Carlo technique. 




A 



Probability Distributions 



We recall here the density and the two first moments of most of the distribu- 
tions used in this book. An exhaustive review of probability distributions is 
provided by Johnson and Kotz (1972), or the more recent Johnson and Hoot- 
ing (2003), Johnson et al. (1994, 1995). The densities are given with respect 
to Lebesgue or counting measure depending on the context. 

A.l. Normal Distribution, Ap(0, i7) 

{0 G MP and E is a. {p x p) symmetric positive definite matrix.) 

Ee,r[X] = e and Ee,r[(X - 0)(X - Of] = E. 

When E is not positive definite, the Afp{6,E) distribution has no den- 
sity with respect to Lebesgue measure on MP. For p = 1, the log-normal 
distribution is defined as the distribution of when X ~ 

A. 2. Gamma Distribution, Qa{a^ /3) 

{a,l3>0.) 

f[x\a,l3) = 

Ea,( 3 [X] = a/p and vara,^(A") = o//?^. 

Particular cases of the Gamma distribution are the Erlang distribution^ 
^a(o, 1), the exponential distribution ^a(l,/?) (denoted by Exp{8)), and 
the chi squared distribution, Qa{y/2, 1/2) (denoted by x^). (Note also that 
the opposite convention is sometimes adopted for the parameter, namely 
that Qa{a,P) may also be noted as Qa{a, 1//5). See, e.g., Berger 1985.) 

A. 3. Beta Distribution, Be{a, P) 

(a,/?>0.) 

f{x\a,!3) = 'b{^) VilW > 



where 
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B{a,(3) 



r{a)r{/3) 
r{a + f3) ■ 



Ea,fi[X] =a/{a + j3) and = a(i/[{a + /?)^(a + /? + 1)]. 

The beta distribution can be obtained as the distribution of Yi/(Yi -h I 2 ) 
when Yi ~ Qa{a, 1) and Y 2 ~ ^a(/3, 1). 

A. 4. Student’s t Distribution, Tp[v^0^E) 

(z^ > 0, ^ G MP, and Z’ is a (p x p) symmetric positive-definite matrix.) 



/(x|z.,0,Z) 



r((^+p)/2)/r(z./2) 

(det Z)^/^(z/7t)^/2 



V 



{v+p)l2 



= 0 (i/> 1) andE0,y;[(X-0)(X-0)'] = i/Z/(z/-2) (z^ > 2). 
When p = 1, a particular case of Student’s t distribution is the Cauchy 
distribution^ C(d,a^), which corresponds to ly = 1. Student’s t distribution 
Tp{iy,0,I) can be derived as the distribution of X/Z when X Vp(0,7) 
and ~ xt- 

A. 5. Fisher’s F Distribution, F{u^p) 

{iy,P> 0.) 



fi^lF p) 



r((z/ -h p)/2)z/^/^p^/^ 

r{u/ 2 )r{p/ 2 ) + 



= P/(p-2) (/9 > 2) and vail,, p(X) = 2/>^(i/+/9-2)/[i/(/9-4)(/9-2)^] 

{p > 4). 

The distribution X{p,q) is also the distribution of (X - 9yX~^(K — 6)/p 
when X ~ Tp{q,9,X). Moreover, if X ~ X{v,p), vXj^p -{■ vX) ~ 
Be{yl2,pj2). 

A. 6. Inverse Gamma Distribution, IQ{a,(5) 

(a,/3>0.) 

oa p-p/x 

/(x|a,/3) = 

Ea,/ 3 [X] = p/{a - 1) (a > 1) and vara,^(X) = l3^/{{a - l)^(a - 2)) 
(a > 2). 

This distribution is the distribution of X“^ when X Qa{a,p). 

A. 7. Noncentral Chi Squared Distribution, xliA 

(A > 0.) 

f{x\X) = l(x/A)(P-2)/V(p_2)/2(V^)e-(^+^>/^ 

Ea[X] =pA-\ and varA(X) = 3p + 4A. 

This distribution can be derived as the distribution of Xf H \-X^ when 

Xi ~ M{6i, 1) and -f . . . + l9^ = A. 
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A. 8. Dirichlet Distribution, T>k{ai , . . . , o/e) 
(ai, . . . >0 and cto = ai H h a*.) 



/(a;|ai,...,afc) 



r{ao) 



r{ai)...r{akY 



„ai-l 









Ea[-^i] = oii/ao, var(Xj) = (ao - ai)ai/[al{ao + 1)] and cow{Xi,Xj) ~ 
-aiaj/[ag(ao + 1)] (i ^j). 

As a particular case, note that (X, 1 — X) ~ T> 2 {oii^a 2 ) is equivalent to 
X ~ Be(ai, 02 ). 

A. 9. Pareto Distribution, Pa(a,xo) 

(o > 0 and xq > 0.) 



/(x|a,xo) 




Ea^xol^] = OLXo/{a - 1) (a > 1) and varc,,a:o(^) = axl/[{a - if' {a - 2)] 
(a > 2). 

A. 10. Binomial Distribution, B{n,p) 

(0 <p < 1.) 

f{x\p) = 

Ep(X) = np and var(X) = np{l —p). 

A. 11. Multinomial Distribution, A4k(n-,pi, . . . ,pk) 

{pi>0 {1 <i <k) and J2iPi = 1-) 



f{xi,...,xk\pi,.. 




Xi—ri' 



Ep(X^) = npi, var(Xi) = npi{l -pi), and cov(Xi,X^) == -npipj {i 7 ^ j). 
Note that, if X ~ Xlfc(n;pi, . . . ,Pk)^ X^ ~ B(n,pi), and that the binomial 
distribution X ~ B{n^p) corresponds to (X, n — X) M 2 {n;p,l-p). 

A. 12. Poisson Distribution, V{X) 

(A > 0.) 

/(x|A) = e“^^lN(x). 

XI 

Ea[X] = A and varA(X) = A. 

A. 13. Negative Binomial Distribution, Meg{n^p) 

(0 < p < 1.) 

f{x\p) = ^ - pYIn{x). 

Ep[X] = n(l — p)/p and varp(X) = n(l — p)/p^. 
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A. 14. Hypergeometric Distribution, Hyp{N;n;p) 

{0 < p < 1, n < N and pN G N.) 

/(l-p)N\ 

f{x\p) — — 

V n / 

Eiv,n,p[-^] = np and varjv,„,p(X) = (N - n)np(l -p)/{N - 1). 




B 

Notation 




B.l Mathematical 


II 

II 


boldface signifies a vector 


H — {/lij} — ||/lij|| 


uppercase signifies a matrix 


1, 1, J = 11' 


identity matrix, vector of ones, a matrix of ones 


A<B 


{B — ^) is a positive definite matrix 


\A\ 


determinant of the matrix A 


tr(A) 


trace of the matrice A 


a+ 


max (a, 0) 


Cl, 0 


binomial coefficient 


Do. 


logistic function 


\Fi{a\b-,z) 


confluent hypergeometric function 


F~ 


generalized inverse of F 


F{x) 


gamma function (x > 0) 


F{x) 


digamma function, (d/dx)r{x) {x > 0) 


Uit) 


indicator function (1 if t G A, 0 otherwise) 


h{z) 


modified Bessel function (z > 0) 




multinomial coefficient 


v/(^) 


gradient of /(^), the vector with coefficients 
(d/dzi)f{z) (f{z) G R and z G R^) 


VV(^) 


divergence of f{z), Y,{d/dzi)f{z) 
if{z) e W and z e K) 


Af{z) 


Laplacian of f{z), /dzf)f{z) 


II ■ IItv 


total variation norm 


II 


Euclidean norm 


[x] or [_x\ 


greatest integer less than x 


\x^ 


smallest integer larger than x 


f{t) oc g{t) 


the functions / and g are proportional 


supp(/) 


support of / 


{x,y) 


scalar product of x and y in R^ 
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maximum of x and y 
minimum of x and y 



x\/ y 
X Ay 



B.2 Probability 



Pn 

SOoiO) 

m 

£{7t) 

F{x\9) 

X ~ f{x\e) 

^gix)] 

E^[h{V)] 

E^[h{0)\x] 

iid 

\{dx) 

Pe 

picq 



m 

0(n), o{n) 
or Op(n),Op(n) 



random variable (uppercase) 

probability triple: sample space, probability distribution, 
and cr-algebra of sets 
/^-mixing coefficient 
Dirac mass at 6q 

energy function of a Gibbs distribution 
entropy of the distribution tt 
cumulative distribution function of X, 
conditional on the parameter 9 
density of X, conditional on the parameter 
with respect to Lebesgue or counting measure 
X is distributed with density f{x\9) 
expectation of g{x) under the distribution X ~ f{x\9) 
expectation of h{v) under the distribution of V 
expectation of h(9) under the distribution of 9, 
conditional on x, 'k{9\x) 
independent and identically distributed 
Lebesgue measure, also denoted by d\{x) 
probability distribution, indexed by the parameter 9 
convolution product of the distributions p and q, 
that is, distribution of the sum of X ~ p and Y ^ q 
convolution nth power, 

that is, distribution of the sum of n iid rv’s distributed from 
density of the Normal distribution ^^(0, 1) 
cumulative distribution function of the Normal distribution Af\ 
big “Oh”, little “oh.” As n oo, ^ constant, 

^ 0, and the subscript p denotes in probability 



B.3 Distributions 

B{n^p) binomial distribution 

Be{a,P) beta distribution 

C{9,a‘^) Cauchy distribution 

D/c(ai, . . . , Dirichlet distribution 

Sxp{\) exponential distribution 

P{p,q) Fisher’s F distribution 

Qa{a, P) gamma distribution 

XQ{a,P) inverse gamma distribution 
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xl. xlW 



Mk{n;pi,..,pk) 

ATp(^,r) 

J\feg{n,p) 

V{\) 

Va{xo,a) 

Tp{v,e,E) 

^[aM 
VVe{a, c) 



B.4 Markov 



AR{p) 

ARMA{p, q) 
C 

d{a) 

t 

AV{x) 

E^[h{Xn)] 

ExolKXn)] 

VA 

Q{x,A) 

>9 

Ke 

MA{q) 

L{x, A) 

P{x,A) 

P^[x,A) 

P,{') 

^xo(') 



Sn{9) 

ta 

rA{k) 

U{x,A) 



chi squared distribution, 
noncentral chi squared distribution 
with noncentrality parameter A 
multinomial distribution 
univariate normal distribution 
multivariate normal distribution 
negative binomial distribution 
Poisson distribution 
Pareto distribution 
multivariate Student’s t distribution 
continuous uniform distribution 
Weibull distribution 
Wishart distribution 



Chains 

atom 

autoregressive process of order p 

autoregressive moving average process of order (p, q) 

small set 

period of the state or atom a 
“dagger,” absorbing state 
drift of V 

expectation associated with 

expectation associated with Px^ 

total number of passages in A 

probability that tja is infinite, starting from x 

variance of SN{g) for the Central Limit Theorem 

kernel of the resolvant 

moving average process of order q 

probability of return to A starting from x 

minorizing measure for an atom or small set 

transition kernel 

transition kernel of the chain {Xmn)n 
probability distribution of the chain (Xn) 
with initial state Xq ~ /x 
probability distribution of the chain (X^) 
with initial state Xq = xq 
invariant measure 

empirical average of for 1 < i < X 
coupling time for the initial distributions p and q 
return time to A 
/cth return time to A 

average number of passages in A, starting from x 
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generic element of a Markov chain 
Xn augmented or split chain 

B.5 Statistics 



x,y 


realized values (lowercase) of the random variables 
X and Y (uppercase) 


x,y 


sample space (uppercase script Roman letters) 


0,\ 


parameters (lowercase Greek letters) 


e,n 


parameter space (uppercase script Greek letters) 


B^{x) 


Bayes factor 


S^^{x) 


James-Stein estimator 


S'^{x) 


Bayes estimator 


(5+ (a;) 


positive-part James-Stein estimator 


Ho 


null hypothesis 


m 


Fisher information 


L{0,S) 


loss function, loss of estimating 6 with 6 


L{e\x) 


likelihood function, a function of 6 for fixed x, 
mathematically identical to f{x\9) 


i{0\x) 


logarithm of the likelihood function 


L^{9\x), l^{6\x) 


profile likelihood 


m{x) 


marginal density 


7t(6») 


generic prior density for 6 


-k\6) 


Jeffreys prior density for 6 


tt{9\x) 


generic posterior density 0 


X 


sample mean 




sample variance 


X*,Y*, x*,y* 


latent or missing variables (data) 


B.6 Algorithms 


[An] 


symbol of the nth algorithm 


B 


backward operator 


Bx 


inter-chain variance after T iterations 


Wt 


intra-chain variance after T iterations 


Dij. 


cumulative sum to i, for T iterations 


5rb,S^^ 


Rao-Blackwellized estimators 


F 


forward operator 


9i(xi\xj,j ^ i) 


conditional density for Gibbs sampling 


mo 


regeneration probability 


K{x,y) 


transition kernel 


k 


transition kernel for a mixture algorithm 


K* 


transition kernel for a cycle of algorithms 
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A/e duration of excursion 

q{x\y) transition kernel, typically used for 

an instrumental variable 
p{x, y) acceptance probability 

for a Metropolis-Hastings algorithm 
St empirical average 

conditional version of the empirical average 
recycled version of the empirical average 
importance sampling version of the empirical average 
Riemann version of the empirical average 
Sh{^) spectral density of the function h 
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Harris recurrence, 207, 222, 233, 
240-242, 259, 379 
importance of, 222, 223 
Hessian matrix, 319 
Hewitt-Savage 0-1 Law, 240 
hidden 

Markov model, 352, 571, 575, 576, 
579-580 
Poisson, 576 

semi- Markov chains, 579 
How long is long enough?^ 512 
hybrid strategy, 378 

identifiability, 89, 181, 398-400 
constraint, 398-400 
image processing, 168, 419 
implicit equations, 79 
importance, 90 
importance function, 97 
importance sampling, 91, 92, 203, 268 
accuracy, 94 

and Accept-Reject, 93-96, 102, 
103 

and infinite variance, 103, 488 
and MCMC, 271, 545-547, 557 
and particle filters, 547 
and population Monte Carlo, 562 
bias, 95, 133 

by defensive mixtures, 103, 562, 
563 

difficulties, 102 
dynamic, 577 
efficiency, 96 

for convergence assessment, 484 
identity, 92, 546, 577 
implementation, 102 
improvement, 134 
monitoring, 153 
optimum, 95 



principle, 90 
sequential, 549 
variance, 126 
IMSL, 22 

independence, 462, 463 
inference 

asymptotic, 83 
Bayesian, 5, 12-14, 269 
noninformat ive, 368 
difficulties of, 5 
empirical Bayes, 82 
generalized Bayes, 404 
nonpar ametric, 5 
statistical, 1, 7, 79, 80 
information 

Fisher, 31, 404 

Kullback-Leibler, 31, 191, 552, 560 
prior, 14, 383, 406 
Shannon, 32 

informative censoring, 23 
initial condition, 293, 380 

dependence on, 292, 379, 462 
initial distribution, 211, 274 
and parallel chains, 464 
influence of, 379, 464, 499 
initial state, 208 
integration, 6, 19, 79, 85, 140 
approximative, 83, 107, 135 
by Riemann sums, 22, 134, 475 
Monte Carlo, 83, 84 
numerical, 80, 135 
bounds, 22, 83 
problems, 12 
recursive, 172, 365 
weighted Monte Carlo, 135 
interleaving, 349, 462 
and reversibility, 349 
property, 349, 355, 356 
intrinsic losses, 80 
inversion 

of Gibbs steps, 478 
of the cdf, 39, 40 

ion channel, 579 

irreducibility, 206, 213, 213, 215, 218, 
274, 284 

Ising model, 37, 168, 188, 326, 373, 408 

Jacobian, 431, 432 
jump 
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mode, 295 
reversible, 430 
stochastic gradient, 163 

Kac’s representation, see mixture 
representation 
/(^e-chain, 211, 213, 220 
kernel 

estimation, 403 
residual, 229, 493, 524 
reverse, 533 

transition, 206, 208, 209 
Kolmogorov-Smirnov test, see test 
Kuiper test, see test 
Kullback-Leibler information, see 
information 

L ’Hospital’s rule, 330 
label switching, 540 
Lagrangian, 19 
Langevin algorithm, 319 

and geometric ergodicity, 319 
extensions of, 320 
Langevin diffusion, 316, 318 
Laplace approximation, 107, 115, 188 
large deviations, 37, 90, 119 
Latin hypercubes, 156 
Law of Large Numbers, 83, 125, 239, 
268, 494, 551, 558 
strong, 83, 242 
lemma 

Pitman-Koopman, 10, 30 
Levenberg-Marquardt algorithm, see 
algorithm 

Liapounov condition, 245 
likelihood, 12 

function, 6, 7 
individual, 326 
integral, 12 
maximum, 82, 83, 172 
multimodal, 11 
profile, 9, 18, 80 
pseudo-, 7 
unbounded, 11 

likelihood ratio approximation, 369 
linear model, 27 
local maxima, 160, 388 
log-concavity, 72, 289 
longitudinal data, 3, 454 



loss 

intrinsic, 80 
posterior, 13 
quadratic, 13, 81 
low-discrepancy sequence, 76 
lozenge, 522 

Manhattan Project, 314 
MANIAC, 314 
Maple, 22 

marginalization, 103, 551 

and Rao-Blackwellization, 354 
Chib’s, 451 
Monte Carlo, 114 

Markov chain, 99, 101, 102, 205, 208 , 
212, 378 
augmented, 217 
average behavior, 238 
divergent, 406 
domination, 394, 578 
ergodic, 268 
essentials, 206 
Harris positive, see Harris 
positivity 

Harris recurrent, see Harris 
recurrence 
hidden, 579 

homogeneous, 162, 209, 286, 501 

instrumental, 349 

interleaved, 349 

irreducible, 213 

limit theorem for, 238, 263 

lumpable, 256 

m-skeleton, 207 

nonhomogeneous, 162, 286, 300, 
316, 317 
observed, 239 
positive, 224 

random mapping representation, 
513 

recurrent, 220, 259 
reversible, 244, 261 
semi-, 579 
for simulation, 268 
split, 217 
stability, 219, 223 
strongly aperiodic, 218, 220 
strongly irreducible, 213, 345 
transient, 258 
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two state, 500 
(/^-irreducible, 213 
weakly mixing, 482 

Markov chain Monte Carlo method, see 
MCMC algorithm 
Markov property 

strong, 212, 247 
weak, 211 
mastitis, 383 

Mathematic a, 22, 4l7lf2-^ 

Matlab, 22, 41 
matrix 

regular, 254 

transition, 208, 211, 217, 224 
tridiagonal, 129 
maxima 

global, 167, 169 
local, 162, 163, 202 
maximization, 6 

maximum likelihood, 5, 7, 10, 83 
constrained, 8, 173 
difficulties of, 10, 12 
estimation, 6, 8 
existence of, 11 
justification, 6 

MCEM algorithm, 183, 185, 340 
standard error, 186 
MCMC algorithm, 213, 236, 244, 268, 

268 

and Monte Carlo methods, 269 
birth-and-death, 446 
calibration, 317 
convergence of, 459 
heterogeneous, 300, 390 
history, 163, 314 
implementation, 269 
monitoring of, 459, 474, 491 
motivation, 268 
measure 

counting, 226 
invariant, 223, 225 
Lebesgue, 214, 225 
maximal irreducibility, 213 
method 

Accept-Reject, 47, 51-53 
delta, 126 
gradient, 162 
kernel, 508 

least squares, 7, 173, 201 



Markov chain Monte Carlo, see 
MCMC algorithm 
of moments, 8 
monitoring, 406 
Monte Carlo, see Monte Carlo 
nonparametric, 508, 563 
numerical, 1, 19, 22, 80, 85, 135, 
157, 158, 497 
quasi-Monte Carlo, 21, 75 
simulated annealing, 163, 167, 200, 
203 

simulation, 19 

Metropolis-Hastings algorithm, 165, 
167, 267, 269, 270, 270 , 378, 
403, 481 

and Accept-Reject, 279 
and importance sampling, 271 
autoregressive, 291 
classification of, 276 
drawbacks of, 388 
efficiency of, 321 
ergodicity of, 289 
independent, 276, 279, 296, 391, 
472 

irreducibility of, 273 
random walk, 284, 287, 288 
symmetric, 473 
validity of, 270 
Metropolization, 394 
mining accidents, 455 
missing data, 2, 5, 174, 340, 346, 374 
simulation of, 200 
missing mass, 475 
mixing, 489 

condition, 263 
speed, 462 

mixture, 4, 295, 366, 368 

computational difficulties with, 4, 
11 

continuous, 45, 420 
defensive, 103, 562 
dual structure, 351 
exponential, 23, 190, 419 
geometric, 420 
identifiability, 89 
indicator, 341 
of kernels, 389 
negative weight, 493 
nonparametric, 420 
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Normal, 11, 88, 365, 388, 398, 399, 
506 

EM algorithm, 181 
number of components 
test on, 88 
Poisson, 23 

reparameterization of, 397 
representation, 45-46, 77, 229, 375, 
471, 524 

residual decomposition, 252 
for simulation, 45, 67, 77, 103, 131, 
373 

stabilization by, 103, 113 
Student’s t, 497 
two-stage Gibbs for, 340 
unbounded likelihood, 11 
with unknown number of 
components, 426, 437 
mode, 160, 482, 499 
global, 172 

local, 10, 22, 172, 201 
attraction towards, 388 
trapping effect of, 390 

model 

ANOVA, 416 

AR, 194, 210, 227, 244, 260, 265, 
428 

ARCH, 308, 365, 368 
ARMA, 508 
augmented, 183 
autoexponential, 372, 382 
autoregressive, 28 
averaging, 427, 433, 440 
Bernoulli-Laplace, 208, 227 
capture-recapture, see capture- 
recapture 
censored, 327 
change-point, 454 
choice, 125 
completed, 341 
decomposable, 423 
embedded, 433 
generalized linear, 287 
graphical, 422 

hidden Markov, see hidden 
Markov, 549 
hierarchical, 383 
Ising, see Ising 
linear calibration, 375 



logistic, 145 
logit, 15, 168 
MA, 4 

mixed-effects, 384 
multilayer, 201 
multinomial, 347, 394 
Normal, 404 
hierarchical, 27 
logistic, 58 

overparameterized, 295 
Potts, 312 
probit, 391 

EM algorithm for, 192 
random effect, 414 
random effects, 397, 406 
Rasch, 415 

saturation, 432, 444, 450, 451 
stochastic volatility, 549 
switching AR, 453 
tobit, 410 

variable dimension, 425-427 
modeling, 1, 367 

and reduction, 5 
moment generating function, 8 
monitoring of Markov chain Monte 

Carlo algorithms, see MCMC 
monotone likelihood ratio (MLR), 72 
monotonicity 

of covariance, 354, 361, 463 
for perfect sampling, 518 
Monte Carlo, 85, 90, 92, 96, 141, 153 
approximation, 203 
EM, 183, 184 
maximization, 204 
optimization, 158 
population, see population 
move 

birth, 437 
death, 437 
deterministic, 432 
merge, 433, 438 
Q,R,M, 578 
split, 433, 438 
multimodality, 22 
multiple chains, 464 

The Name of the Rose, 458 
neural network, 201, 579 
neuron, 579 
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Newton-Raphson algorithm, see 
algorithm 
nodes, 422 
non-stationarity, 466 
Normal 

approximation, 85, 283 
variate generation, 40, 43, 69 
normality 

asymptotic, 86 
violation, 130 
normalization, 485 
normalizing constant, 14, 16, 17, 50, 
137, 147, 191, 271, 367, 405, 
431, 471, 479 

and Rao-Blackwellization, 309 
approximation, 95 
evaluation, 479 

Northern Pintail ducks, 59 
nuclear pump failures, 386, 470, 474 
null recurrence and stability, 242 
Nummelin’s splitting, 216, 225 

Oakes’ identity, 186 
Occam’s Razor, 458 
on-line monitoring, 482 
optimality, 316 

of an algorithm, 41, 58, 91, 95, 
137, 374 

of an estimator, 14, 82 
of the Metropolis-Hastings 
algorithm, 272 

optimization, 19, 52, 79, 293, 294, 313, 
315 

exploratory methods, 162 
Monte Carlo, 160 
Newton-Raphson algorithm, 20 
simulated annealing, 166 
stochastic, 268 
order statistics, 44, 65 
ordering 

natural, 526 
stochastic, 518 
Orey’s inequality, 253, 304 
0-ring, 15 
outlier, 384 
overdispersion, 383 
overfitting, 428 
overparameterization, 398 



Paharmacokinetics, 384 
paradox of trapping states, 368 
parallel chains, 462, 479 
parameter 

of interest, 9, 18, 25, 31 
location-scale, 398 
natural, 8 
noncentrality, 121 
nuisance, 9, 80, 89 
parameterization, 377, 388, 399, 408 
cascade, 398, 419 
particle, 545, 547, 552 
particle filter, see filter 
auxiliary, 554 
partition product, 456 
partitioning, 155 
patch clamp recordings, 579 
path, 461 

properties, 479 
simulation, 554 
single, 269 
perceptron, 458 

perfect simulation, 239, 471, 511, 512 
and renewal, 474 
first occurrence, 512 
performances 

of congruential generators, 73 
of estimators, 81 

of the Gibbs sampler, 367, 388, 396 
of importance sampling, 96 
of integer generators, 72 
of the Langevin diffusion, 320 
of the Metropolis-Hastings 
algorithm, 292, 295, 317 
of Monte Carlo estimates, 96 
of renewal control, 496 
of simulated annealing, 169 
period, 72, 73, 218 
Peskun’s ordering, 394, 578 
Physics, 37 
pine tree, 452 

Pluralitas non est ponenda sine 
neccesitate, 458 
Poisson variate generation, 55 
polar coordinates. 111 
population Monte Carlo, 560 
positivity, 225, 273, 344, 377 
condition, 345 
potential function, 258, 291 
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power of a test, 12 
principle 

duality, 351, 354, 367, 495, 512, 
524 

likelihood, 239 
MCMC, 267 
parsimony, 458 
squeeze, 53, 54 

prior 

conjugate, 13, 30, 82 
Dirichlet, 424 
feedback, 169 
hierarchical, 383 
improper, 399, 403, 405 
influence of, 171 
information, 30 
Jeffreys, 28 

noninformative, 31, 383 
pseudo, 452 
reference, 14, 31 

for the calibration model, 18 
robust, 124 
probability 

density function (pdf) 
monotone, 72 
unimodal, 72 
integral transform, 39 
Metropolis-Hastings acceptance, 
271 

of acceptance, see acceptance 
probability 
of regeneration, 473 
problem 

Behrens-Fisher, 12 
Fieller’s, 18 
procedure 

doubling, 336 
stepping-out, 336 
process 

birth-and-death, 446 
branching, 219 
forward recurrence, 253 
jump, 446 

Langevin diffusion, see Langevin 
Poisson, 43, 66, 386, 454 
renewal, 232 
programming, 41 
parallel, 153 
propagation, 571 



properties 

mixing, 512 
sample path, 239 
pseudo- likelihood, 7 
pseudo-prior, 452 

quality control, 3 
quantile, 500 

R, 22, 420, 509 
random effects, 406 
random mapping, 513 
random walk, 206 , 225, 226, 250, 257, 
284, 287, 290, 292, 295, 315, 
317 

with drift, 319 
multiplicative, 249 
on a subgraph, 322 
randomness, 36 
range of proposal, 295 
Rao-Blackwellization, 130 , 133, 296, 
402-403 

and importance, 133 
and monotonicity, 354, 361 
argument, 351 

for continuous time jump processes, 
447 

convergence assessment, 484, 486, 
496 

of densities, 403 

for population Monte Carlo, 564 
implementation, 131 
improvement, 297, 356, 357 
for mixtures, 367 
nonpar ametric, 295 
parametric, 490 
weight, 298, 310 

rate 

acceptance, 286, 292, 293, 295, 
316, 317, 373, 382, 401 
mixing, 464 
regeneration, 471 
renewal, 494 
rats carcinoma, 24 
raw sequence plot, 483 
recurrence, 207, 219 , 222, 225, 259 
and admissibility, 262 
and positivity, 223 
Harris, see Harris recurrence 
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null, 223 

positive, 223, 406 
recursion equation, 421 
recycling, 354, 485, 499 
reference prior, see prior 
reflecting boundary, 291 
regeneration, 124, 246, 302, 471, 473 
regime, 549 
region 

confldence, 16 

highest posterior density, 16, 18 
regression 

isotonic, 25, 172 
linear, 7, 27, 28, 498 
logistic, 15, 146 
Poisson, 59 
qualitative, 15 

relativity of convergence assessments, 
465 

reliability, 23 

of an estimation, 96 
renewal, 261, 470 

and regeneration, 471 
control, 495 
probabilities, 473, 493 
rate, 494 

theory, 215, 261, 470 
time, 217, 229, 491 
Renyi representation, 65 
reparameterization, 159, 367, 399 
resampling, 32 

systematic, 555 
unbiased, 114 
resolvant, 211 
chain, 251 

reversibility, 229, 244, 429 
reversible jump, 540 
Riemann integration, 21 
Riemann sum 

and Rao-Blackwellization, 410 
control variate, 474, 483 
Robbins-Monro, 201-203 
conditions, 202 
robustness, 10, 90 
running mean plot, 127, 128 

saddlepoint, 120, 162 

approximation, 174, 282 
sample 



independent, 140 
preliminary, 502 
test, 501 
uniform, 141 
sampling 

batch, 356 
comb, 574 

Gibbs, see Gibbs sampling 
importance, 299, 497 
parallel, 546 
residual, 229, 574 
stratifled, 155 
sandwiching argument, 520 
saturation, see model saturation 
scale, 163 
scaling, 443 
seeds, 75, 515 
semi-Markov chain, 576 
sensitivity, 90, 92 
separators, 423 
set 

Kendall, 260 
recurrent, 220 

small, 215, 218, 221, 235, 259, 274, 
470, 471, 490, 495 
transient, 249 

uniformly transient, 220, 249 

shift 

left, 74 
right, 74 

shuttle Challenger, 15, 281 

signal processing, 158 
Simpson’s rule, 21 

simulated annealing, 23, 209, 287, 315 
acceleration, 160 
and EM algorithm, 178 
and prior feedback, 171 
and tempering, 540 
principle, 163 

simulation, 22, 80, 83, 268, 269, 403 
in parallel, 124, 464 
motivation for, 1, 4, 5, 80 
numerical, 462 
parallel, 464 
path, 556 

philosophical paradoxes of, 36, 37 
recycling of, 295 
Riemann sum, 135-139 
twisted, 119 
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univariate, 372 

versus numerical methods, 19, 21, 
22, 40, 85, 157 
waste, 269, 295 

size 

convergence, 501 
warm-up, 501 
slice, 335 

slice sampler, 176, 321, 322 , 326 
convergence of, 329 
drift condition, 330 
genesis, 335 

geometric ergodicity of, 332 
polar, 332, 365 
poor performance of, 332 
relationship to Gibbs sampling, 
337, 343 
uniform, 364 

uniform ergodicity of, 329 
univariate, 526 

small dimension problems, 22 
smoothing 

backward, 572 
spatial Statistics, 168 
spectral analysis, 508 
speed 

convergence, 295, 388, 393, 402, 
489 

for empirical means, 479 
mixing, 462, 482 
splitting, 219 
S-Plus, 41, 509 
squared error loss, 108 
stability, 398, 488 
of a path, 96 
stabilizing, 495 
start-up table, 73 
state, 549 

absorbing, 368 
initial, 209 
period, 217 
recurrent, 219 
transient, 219 
trapping, 368, 399 
state-space, 549 

continuous, 210 
discrete, 224, 313, 394 
finite, 168, 206, 227, 284, 352, 495, 
512, 579 



stationarity, 239, 483 
stationarization, 461 
stationary distribution, 206, 223 , 512 
and diffusions, 206 
as limiting distribution, 207 
statistic 

order, 135, 151 
sufficient, 9, 10, 100, 130 
Statistical Science^ 33 
Statlib, 502 
stochastic 

approximation, 174, 201 
differential equation, 318 
exploration, 159 
gradient, 287 
equation, 318 
monotonicity, 518 
optimization, 268, 271 
recursive sequence, 514, 536 
restoration, 340 
volatility, 549 

stopping rule, 212, 281, ‘465, 491, 497, 
499, 502 
stopping time, 516 
Student’s t 

approximation, 283 
posterior, 300 
variate generation, 46, 65 
subgraph, 321 
subsampling, 462, 500 

and convergence assessment, 463 
and independence, 468 
support 

connected, 275 
restricted, 97, 101 
unbounded, 49 
sweeping, 397 
switching models, 579 

tail area, 122, 282 

approximation, 91, 112, 114, 122 
tail event, 240 

tail probability estimation, 93 
temperature, 163, 167 
decrease rate of, 167 
tempering, 540-543, 558 
power, 540 

termwise Rao-Blackwellized estimator, 
151 
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test 

x", 508 
Die Hard, 37 

Kolmogorov-Smirnov, 37, 76, 466, 
509 

Kuiper, 466 
likelihood ratio, 86 
for Markov chains, 500 
nonpar ametric, 466 
power of, 86, 90 
randomness, 38 
stationarity, 462 
uniform, 36 
testing, 86 
theorem 

Bayes, 12, 14, 50, 572 
Central Limit, 43, 83, 123, 235-237, 
239, 242-244, 246, 261, 264, 
376, 481, 491, 492, 559 
Cramer, 119 
Cramer- Wold’s, 126 
ergodic, 239-241, 269, 302, 462, 

481 

Fubini, 13 

fundamental, simulation, 47 
Glivenko-Cantelli, 32 
Hammersley-Clifford, 343, 344, 

408, 485 

Kac’s, 224, 229, 471 
Kendall, 236 
Rao-Blackwell, 130, 296 
theory 

Decision, 14, 81, 83 
Neyman-Pearson, 12 
renewal, 470 

There ain’t no free lunch^ 558 
time 

computation, 486 
inter jump, 317 
renewal, 216, 491 
stopping, 211, 212, 216, 459 
time series, 4, 508 

for convergence assessment, 508 
total variation norm, 207, 231, 236, 253 
training sample, 201 
trajectory (stability of), 218 
transience of a discretized diffusion, 318 
transition, 273 



Metropolis-Bastings, 270 
pseudo-reversible, 472 
transition kernel, 206, 208, 210, 490 
atom, 214 
choice of, 292 
symmetric, 207 

traveling salesman problem, 308 

tree swallows, 196 
turnip greens, 414 

unbiased estimator 
density, 486 
L 2 distance, 505 

uniform ergodicity, 229, 261, 525 
uniform random variable, 39, 47 
generation, 35, 36, 39 
unit roots, 453 
universality, 58, 275, 295 

of Metropolis-Bastings algorithms, 
272 

variable 

antithetic, 140, 143 
auxiliary, 3, 340, 374 
latent, 2, 327, 341, 391, 550 
variance 

asymptotic, 491, 496 
between- and within-chains, 497 
finite, 94, 95, 97 
infinite, 102 
of a ratio, 125 

reduction, 90, 91, 96, 130-132, 141 
and Accept-Reject, 143 
and antithetic variables, 151 
and control variates, 145, 147 
optimal, 155 
variate (control), 145 
velocities of galaxies, 426, 439, 450 
vertices, 422 
virtual observation, 171 
volatility, 550 

waiting time, 51 

Wealth is a mixed blessing^ 433, 446, 
570 

You’ve only seen where you’ve been, 
464, 470 
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